How do you solve a problem like sample data?

Dec 172015

It’s a tale as old as development – you make an application, and now you need to sell it. That means you need to have a demo, and demos require data in there. The dilemma is, what do you do to get that data? Do you have a demo app in a sandboxed environment, do you just add it to your regular production database, do you just take some screenshots of what the app looks like with data from a development environment, or do you do something else entirely? It seems like a stupid thing to worry about, until you’re actually trying to figure it out, then it becomes really important because whatever you decide to do about sample data, you’re going to have to live with forever.

Not having any demo data isn’t really a good option. You have to sell this software, which means people are going to want to see it in action, and it really helps people to see all the features your product offers when there’s already stuff there for potential customers to interact with. You could start a demo with no application data and then walk customers through everything from scratch, but that doesn’t make for very good promotional screenshots or videos. There’s a psychological component to some of these things – even if some of the numbers your app is showing are unrealistically good, and even if the customer knows they’re unrealistically large, they’re still good, and that provides a warmer, fuzzier feeling than numbers that are bad, or 0.

You could create a sandboxed demo environment. This gives you the ability to demo your application without having to worry about the sample data since you can always just destroy the database and re-create it from scratch with pristine data. It also gives you a place to beta test your latest changes internally, to a small group of beta customers, or even just to test things like database migration. Nobody has to worry about their data getting jacked up because it’s a sandbox – just wipe the whole thing and start over from scratch like it never happened. This works particularly well for a web application that you offer to the general public. You stand up a demo environment that’s periodically reset and give the link out to people to play with. You can see an example of this philosophy in action with Discourse’s sandbox. In fact, you could have multiple sandboxes – a production mirror where it’s the same instance of your software that’s live in production, and a staging sandbox where people can try new features if they want, all at 0 risk to their existing application data.

The downside to having a sandbox environment is that you now have multiple environments that need to be maintained with every release. If you’re using release automation, that’s no big deal, but if your release process is manual, this is a hassle. Even if you do have fully automated releases, it’s still 2 sets of infrastructure, 1 of which is only going to be used sporadically at best. If your sales team wants to customize the data for the customer they’re targeting, then you’re going to be spending a lot of time tweaking your sandbox, instead of your application.

Lastly, there’s always the “just put it in production option.” I have to be honest here, I’m not totally convinced this is actually better than nothing. This way there’s only 1 version of the application that you ever have to maintain, and you only have 1 set of application infrastructure that you’re having to keep up and running. Again, if your deployments are automated, this is a non-issue. This gives you the most realistic demo of your application, since it’s the live application.

The downside to having all your demo data in your production is that you have a lot of crap data clogging up your application. If you’re trying to process stats about your users and what they’re doing this data’s going to throw it all off. It’s one thing to put this data into a production environment to load-test your application pre-release, but after that it’s doing nobody any favors. If you need to update the schema of your data, that means you have to migrate junk data. Since that data was probably side-loaded and not the same as the regular data you have in your application, that means there is probably some strange interactions between the raw data and data added legitimately through the app. In fact, there may be a few chunks missing making it harder to find and purge later (for instance, if you have automated data processing jobs, you may have deliberately left out the pieces needed to run or even trigger those jobs from the sample data so you would at least avoid wasting those resources).

So let’s review, having demo data in production means you’re wasting storage, your data is slowing down processing on any type of batch operation (because it’s more data you have to churn through), or the data is deliberately incomplete so as to avoid doing work on stuff you know is junk. If your sales team is also using the same account with demo data internally, there’s a very good chance you’ll see some strange interactions between data that was added legitimately through actual app usage, and data that you put in for sales calls. Long story short, demo data in the live application is a bad idea.

Personally, I think this is a scenario where something like Docker could really shine. You build a container that runs the latest version of your application and your sample data generation script either pointing to a local copy of the database or to another container running the database. You run your script, which loads the data into the containerized application, and demo that. Then when you’re done, kill the container(s). This makes it easy to customize the demo for different clients, your sales engineers can just update the sample data source with data specific to the demo, without messing up anybody else’s demos (they’re running in separate containers). The crap data is out of your application, sales people still have data they can show off to potential customers, and all you have to do is make sure your deployment process posts an updated Dockerfile, which is trivial if you’re using Docker in production or development already. Even if you’re not, it’s not that hard to add to your existing deployment pipeline. Everybody wins.

While I understand the need to have demo data for your application, I don’t particularly like having it in the live application itself. I’d much prefer it in some sort of outside environment explicitly designed for demos. Since most applications don’t need a permanent, dedicated environment for demos, Docker containers would be a great fit for this sort of thing. Just spin up a container (or set of containers) with a dedicated datastore holding data that’s custom-built for the demo in question, knock the potential customer’s socks off with a presentation that was tailer-made for them, and then just kill the containers, all without cluttering up the live application or wasting production resources. If you’re automating your deployments, having a separate demo environment should be trivial. If you aren’t, then hopefully this will be 1 more piece of motivation to start.