Jul 262012
 

I did a little bit of work using mongo recently, and I stumbled across some of those “lessons learned the hard way” things that I thought I’d share. If you haven’t used mongo or aren’t familiar with it, I recommend their official site and Wikipedia for a quick introduction on what mongo is and how it works.

As tempted as it is to kick off with some good old-fashioned NoSQL vs. SQL ranting, I’ll skip that for some other time. Right now I want to focus on the random things I ran into and wound up having to deal with. First on that list, filtering for related dates and times that are kept in separate records requires ISO-formatted date strings. Here’s the scenario I had: I was getting data that had dates and times stored as separate values. Now, in hindsight, it may have made my life simpler if I had grouped the dates and times together (they were related, after all) and put them into the database as a single datetime. However, I was trying to do as little manipulation of the data as possible so that thought didn’t cross my mind. The problem I ran into was with the time field. If you store the time in a datetime object, the time is automatically associated with a date, and there’s no getting around that. Specifically, it’s associated with the date the record is created, not the date in the other field. Whoopsie.

So after Googling around, I stumbled on storing dates and times as separate strings in an object. OK, that’s a good start. The other catch is to make sure the string is an ISO date string. You do this, and the strings can be compared relatively (e.g. you can do things like less than [or equal to], greater than [or equal to], or equal). Since you’re using strings, you can actually avoid the separate object altogether, but that part’s purely optional.

Another thing I’ve been dealing learning the hard way is this, if your application needs to run on a mongo replica set, develop it on a mongo replica set. The specific problem I wasted a lot of time on is this: we were periodically seeing an error in the logs about a save not happening correctly, although the record was still showing up when we looked in the database itself. The problem was that subsequent write attempts that were supposed to occur immediately after the failed save attempt weren’t happening, giving us bad data. I’ll skip the play-by-play of the hours I spent digging around in the code trying to figure out just what the hell was going wrong, in part because it’s boring, most of it is useless (after all, it didn’t work), and in part because it’s been a few days and I just plain don’t remember.

Anyways, the main cause of this issue was me writing a record to the database and immediately trying to query for that record. Now, when you’re doing this with just 1 mongo instance, like I was when I was developing the application, this is no problem, make sure the safe flag is set to True and keep on trucking. When you’re doing this with a replica set, it doesn’t work quite as well. Writes happen on the primary mongo instance, and then eventually get propogated to the other servers. You can set the minimum number of servers that need to have the data written to them before returning from the write operation, but the only you can guarantee that writing then reading will work successfully is to force a write to all the mongo instances. This ruins the speediness that people use mongo for because a) You have to know exactly how many mongo instances are up and running at the time of the write (“all the instances up at the moment” is not a valid option for that flag it seems), which menas stopping and counting them and b) You have to stop and wait for the data to be written to every single mongo server. Mongo replica sets have eventually consistency by default, and unless you’re writing super critical vitally important data to the database, don’t bother (and if you are, why aren’t you using a SQL database that has built-in atomicity?).

If you’re going to have objects (like Python’s datetime) in the database that you intend on converting to and from JSON, use the json_util pymongo library to go between JSON and BSON. If you’re not familiar with mongo, looking over the database you’d think that mongo stores JSON documents. You’re close, but wrong, it stores documents in a format called BSON (binary JSON). For all intents and purposes they’re the same, including document structure and syntax, unless you’re dealing with objects. json_util (it’s part of pymongo) can handle those distinctions for you. You use it with simplejson to write/load mongo data as JSON via the default/object_hook properties (respectively). For instance, to create JSON from a mongo record, you’d type:

import simplejson
from bson import json_util

json_data = simplejson.dumps(mongo_record, default=json_util.default)

And vice versa, to create a python object from a JSON document that has objects in it, use:

import simplejson
from bson import json_util

python_object = simplejson.loads(json_data, object_hook=json_util.object_hook)

Now that you have all the messy object conversion between BSON and JSON being handled on your behalf, what about converting from BSON to Python? After all, you may actually want to do something with the data before sending on to whoever asked for it. Well, you’re in luck, because the code I showed you earlier is exactly what you need, specifically both lines of the code. simplejson.loads() will create a Python dictionary-style object you can play with, but you can’t call it directly on records as they come out of the database (not if they have objects in them at least). To get around this, dumps() the mongo data into JSON, and then loads() them back into Python. So your code would look like:

import simplejson
from bson import json_util

json_data = simplejson.dumps(mongo_record, default=json_util.default)
python_data = simplejson.loads(json_data, object_hook=json_util.object_hook)

Lastly, the (I’ve come to dread it) AutoReconnect error. Now, before I go on, let me disclaim that I was using an older version of mongo and pymongo (both pre-2.0). Having said that, this tip simply boils down to you are responsible for handling your own AutoReconnect issues. When we were doing some testing, QA would kill the master in a mongo replica set. No problem, mongo promptly elects a new master, so far so good. Pymongo also seems to handle this just fine, which meant it was us, not our libraries or database.

We had a library to manage connections to replica sets, including periodically checking slave connections and refreshing those connections as necessary. When we killed a master, a slave became the new master, but it was still in the list of slave nodes in our library (which makes sense, seeing as how the slaves hadn’t yet been polled to discover 1 of them was now in charge). Refreshing that connection killed the new master, leaving the replica set with no master. This throws a pymongo AutoReconnect error, which caused a lot of stress, distress, and massive pains in my butt to figure out and get working. We solved the issue by wrapping the database references in a loop that tries the call until it either succeeds or we’ve gone through all the nodes, at which point, yes, there’s an error.

Hopefully, these tips come in handy should you ever write anything involving Python and mongo. Well, hopefully you won’t run into these issues at all, but this being a useful reference can be your secondary plan. Either way, they more you know and some such G.I. Joe-ism.

 Posted by at 11:24 AM