Feb 232016

Amazon’s DynamoDB service is a managed NoSQL database that promises great speeds that allow it to be “…a great fit for mobile, web, gaming, ad tech, IoT, and many other applications.” That claim is pretty much just a pipe dream. The reality is that DynamoDB is a terrible fit for most applications, and your best bet would be to prefer regular NoSQL databases and manage the machines yourself.

It’s not all terrible

First, there are a couple of good things that DynamoDB offers, and it’s worth noting them in the interest of thoroughness. First and foremost, it’s managed hardware so you get things like resilience, guaranteed availability, and replication without having to directly manage the servers for them. As someone who’s not into spending a lot of time or resources managing physical servers, that’s a pretty big benefit. Most of headaches of maintaining your own datastore are settled, and from a server perspective, the database is being set up and administered properly.

DynamoDB also promises incredible performance and linear scaling, which is fantastic. That’s great performance, and between that kind of speed and scalability along with having a managed servers DynamoDB certainly seems like a compelling option. However, DynamoDB gets these performance benefits by enforcing a very prescriptive format on the documents in the datastore. As a result, the benefits from DynamoDB are all assuming that you can fit it into to your application’s datastore needs, which from my experience is a pretty big ask.

The case study

We use several AWS services at my job, and when Amazon announced their own managed document-based datastore, there’s been a lot of excitement to try it out. We’d used it for a few internal proof-of-concept projects, but it’s never as the datastore for a full-fledged application. During our engineering department’s hack days event, we built a fun little application for automating most of the company’s rotating kitchen duty responsibilities, and decided that this would be the perfect time to get a sense of using DynamoDB for real. It’s an application that’s only going to be used by us, so if this is problematic customers never see it. It’s about kitchen duty, so if it goes down, nobody’s going to care that much. It’s a small and fairly simple application, so we’re not trying this while trying to do something crazy. In other words, it’s the perfect low-risk opportunity to play with a shiny new toy.

Why it sucks

As I had mentioned, we were already using DynamoDB internally for a couple of proof-of-concepts, and the thing made the biggest impression on me is the thing that’s my biggest gripe with DynamoDB – you can’t just query for data using arbitrary criteria. Instead, you can retrieve documents using their ID hash, along with limiting results using an optional range key (both of which need to be specified when you create the table). DynamoDB also lets you scan the table, and specify some additional filters for the pages that comes back, and then return that, but that is much more read-intensive (which costs more), and behaves counter-intuitively.

I don’t care what your database claims to offer the world – being able to get my data back out using any criteria I want is a core feature of a database, and not supporting arbitrary queries means your database is virtually unusable. DynamoDB requires that I either already know my primary key, or that it be something calculable in order to even have a chance of getting my data back out, because no version of querying DynamoDB works without me being able to provide it. Getting documents out of DynamoDB is more lookup based on a known key than query, which makes it seem more like a file system than a database. Amazon’s problem is that they already have S3, which is a fantastic file system for which DynamoDB is a rather awkward replacement.

By the way, although I’m using the term “primary key” loosely here. I’m not referring to a field that has to be unique amongst all records in the collection. Multiple documents can have the same primary key. You can specify a secondary range sort field to help narrow queries down. You can also iterate through the data in code after a query returns to narrow things down if you want. In our case, there are always at least 2 records in 1 of our collections with the same primary key value, but the secondary sort key was always enough if we needed to limit our results to just 1 record.

Updating data that’s already in DynamoDB is a massive pain in the crack too. There’s a very specific way you have to format the data going into DynamoDB, including mapping the field names you’re updating to variables and the field values to other variables, and then writing an update expression using those variables. It’s overly complicated if you’re not using an annotated DynamoDB POJO – you end up building out this counter-intuitive maddeningly complex update document just to ultimately increment a counter in the record at hand. At worst case, using a database like Mongo lets you just build a raw document with the new data (or use the $set operator to just update the fields you need).

What makes the headaches around updates so much worse is that modifying a document in the DynamoDB web console is incredibly easy (the live, production DynamoDB web console, not the local shell you can use for development). Maybe Amazon abstracted away all the complications in their console, but that just begs the question of why they couldn’t do it in their libraries. Oh, by the way, if you make a mistake in your update object, don’t expect the error message to be of any use. It’s a lovely little generic number:

"An operand in the update expression has an incorrect data type"

Enlightening, I know. I bet you know exactly what the issue is in my hypothetical update and could immediately point it out to me if I pasted it here.

Another thing, if you want to get the most out of DyanmoDB’s performance, you’re going to have to tune the read and write capacities. So now you need to know the size of the data going into your database, and how fast you’re going to need to be able to add to it or get it out, and this math changes based on if you want strongly or eventually consistent operations. Remember how this was supposed to be easier because Amazon was managing this?

DynamoDB behaves differently when you’re running it locally than in production. That puts testing somewhere in between unreliable and pointless. During our app development, I had a DynamoDB running in a Docker container on my laptop and was able to see code execute flawlessly, only to have it fail when pushed online. Either the DynamoDB jar Amazon gives developers to run it locally is considerably out-of-date with the production version, or there’s something seriously wrong with it, and regardless of which it is I’d expect better out of an Amazon library. The only way to test an application using DynamoDB on the back-end is apparently to have another DynamoDB up on AWS for you to use for testing. Sounds ridiculous, right? As popular as that “I don’t always test code, but when I do, I do it in production” meme is, we all know it’s a terrible practice, but it’s also all that you’re left with when you run DynamoDB.

By the way, when you’re creating new objects to store in DynamoDB, you had better initialize every member variable to some reasonable default (e.g. Lists should be empty List objects, integers should be 0, etc.). This is the part where we saw different behavior running DynamoDB locally vs. in production. Locally, DynamoDB had no problem with this and used the plain old Java default values when we tried to increment/decrement an integer, or empty objects such as when we were altering a list. In production, we got NullPointerExceptions. Now, if we saw the same behavior in both places, this would have been something nobody but the person who initially wrote the code noticed. As it was, everyone was painfully aware of this as we were field-testing the app just before turning the company loose on it. This led to repeated emergency fixes to the code that should have been easily avoided.

There are some libraries like DynamoDBMapper that probably would have cut some of these problems off at the pass, but we weren’t using them for this project. Given that we wanted to see if DynamoDB could be a viable database for our applications, we probably should have been.

Really though, I would have been much happier if Amazon had just started offering managed Mongo clusters. Just let me set a minimum size for the cluster, a cap on how big I’m willing to pay for it to be, and that’s it. I can query it any way I want, and I don’t have to configure read and write capacities. This Amazon we’re talking about, I figure they know how to handle auto-scaling. There’s tons of tooling for it, because it’s a pretty standard database, and pretty much all the quirks are well-known, well-documented, and have known work-arounds. The performance of Mongo is still good, so it’s not like you’re going be missing anything on that front. It’d basically be like having the benefits of the DynamoDB managed hardware, without the having all the hassles that come from using the DynamoDB database itself.

DynamoDB offers a lot of speed and managed resources, which sounds like something that’s too good to be true, and it is. To do that, they committed the greatest sin a database could make – they made a database that you can’t just query. Coupled with the inconsistent behavior we saw running a jar locally vs. running in production, and the overly prescriptive steps involved for performing actions on the tables just makes DynamoDB far more trouble than it’s worth. It sounds great in theory, but in practice just falls painfully short. My personal advice, if you need a good NoSQL database, just use Mongo.

 Posted by at 1:07 AM