Saturday, August 29, 2009

SQL madness

Tonight I'm working on a side project that involves creating a web service to return some javascript objects for an ajax application. Part of the joy is serializing the data from all the various tables into a single JSON object.

The data I'm working with involves a talent agency, so there are people and each person has multiple photos in their profile. Several other bits of data, of course, but this is what's causing the problem.

As with everything programming related, you can take your pick from several methods to get the job done. Usually I'd write a script on the back end to traverse the dataset and concatenate everything together. For this project, I'm working with legacy PHP code and gradually transitioning to ASP. Because of the transition, I'm trying to abstract the data representation into a view so that I don't have to rewrite the JSON serialization in two languages. Thus, a view!

Now... the view has several joins in it, and I only want one row per person in the view. To accomplish this, the view uses several GROUP_CONCAT operations. For each person in the database, the data looks something like this:

{id:1234, photos:[ {url:"abc.jpg"}, {url:"def.jpg"} ] }

Some people have a LOT of profile photos. Like over 30. I hate these people.

Group_concat has a default limit of 1024 characters, and the results for these people exceed that limit. The query returns only the first 1024 characters, and the resulting JSON ends up being broken-- which breaks the entire ajax app.

Curse these people.

That being said, here are my options:
  1. Delete excess photos for people who have too many photos.
  2. Only return enough results to fill the 1024 characters without going over.
  3. Set the group_concat_max_len to be a larger value.
  4. Create a stored procedure to return the concatenated results.
Option 1 is by far the easiest. The problem is that it's not just the quantity of photos that's the problem-- if people have long descriptions in their photos I could run into this again with 4-5 photos. It's easy, but not really acceptable.

Option 2 is kind of what I've settled on as an intermediate solution. I'm only returning one photo as the headshot, then I'll return all the photos when the details for that individual are returned in a separate request. For those 4-5 people who are going to break the system, I'm just going to let it break. It's only 0.1% of the dataset, and if somebody complains I'll go back to Option 1 and nuke their excess photos.

Option 3 is the ideal solution. For reasons unknown to me, I can't get the variable to change value! Isn't that why they're called variables instead of statics?! Sheesh.

show variables where Variable_name = 'group_concat_max_len'; /* 1024 */
show variables where Variable_name = 'max_allowed_packet'; /* 16,777,216. Plenty big, no? */
SET group_concat_max_len := @@max_allowed_packet; /* SHOULD assign the value, right? */
show variables where Variable_name = 'group_concat_max_len'; /* Value is STILL 1024! */

Tech support is closed at the moment, which is why I'm not pursuing this option harder.

Option 4 is my intermediate punt. I haven't made a stored function in SQL in about 5 years, and the syntax is giving me headaches at 1:15 in the morning. Which is why I'm blogging this rather than tackling it.

Once the kinks are worked out of the system, I may change things up a bit further. The view takes about 3 seconds to calculate, and in a web app that's an eternity. What really needs to happen is a table with all the values pre-computed. For every update to one of the other tables it would recompute the JSON text, then everything would be ready to go for the next query.

A quick performance comparison:
  • select *, using all the joins of the view: 5 seconds
  • select the id column, using all the joins: 0.7 seconds
  • select -any- of the columns from the view: 3 seconds
So the view is pretty linear time. Faster than doing the raw joins, but it doesn't speed up if I'm only after a subset of the data. I believe the precomputed table would speed this up considerably, at the expense of slower updates.

The database is updated infrequently enough that I could probably get away with this. Or, I could run a script every few hours to update the JSON table.

So many options!

No comments:

Post a Comment