Data ingestion sucks

Aug 312023

If you spend any time on any system that involves importing data, and you’ll likely have heard the phrase “garbage in, garbage out (GIGO).” It’s usually muttered (obscenities optional) right around the point a developer has given up on trying to “fix” the data coming into a system and willing to abandon users to whatever crappy, unusable, incorrect data they have floating around on their system. As long as whatever junk users were importing doesn’t actually break your code, let them suffer.

My first introduction to GIGO was working at a media metadata company, We were reading in a CSV file of data around clips of a show to be used to populate a web tool where editors could browse these clips and note which ones should be used for show promos (if you remember watching TV and seeing commercials for sitcoms where the voice-over would be inter-stitched with characters delivering punchlines or a quick back-and-forth, then you’ve seen those types of clips).

The engineer who originally worked on the project before handing it off to me had spent a lot of time going back-and-forth with the the data and the company supplying it (who owned and produced the show), working to get clip times to line up with what was in the transcript, making sure the episode information was correct, the transcript matched what was actually said, etc. Technically, everything was working pretty quickly. It didn’t take long for him to build something that would take a CSV record that tagged season 2, episode 3 from 02:38.012 to 02:40.155 with a transcript of what was being said in the scene to be indexed and searchable later. The problems came when it turns out the scene in question was in episode 4, or started at 02:41.938, or that what the characters were saying didn’t match the transcript we had. Eventually, he decided if the company couldn’t be bothered to make sure their data was actually correct in the first place, they could deal with a tool that was always wrong, because it wasn’t worth the time and effort for him to do a quality control check on that. So long as the the clip being shown matched the show data (season/episode/timestamp) from the file, the tool was working.

My next experience was when I was working at an email marketing on a data import tool for importing contact data in bulk. Essentially a list of people of signed up for an email newsletter at the checkout counter of a retail store, but also bulk updates of contact data (they changed some customer details in the commerce part of the system). This was both particularly eye opening and infuriating. When configuring the tool, we let our users map csv fields to fields in our system, complete with preview. They could see that column 3 held first names, and column 5 held last names, column 1 was email address, etc. Surely, with this level of convenience and visibility, errors should be minimal, right?

Well, that depended. Some companies made a best effort to keep their data clean but nobody’s 100% correct. Some companies just dumped everything they had and threw it at the wall to see what imported. When the app was first written, we tried to be nice by storing the errors from your most recent few imports in a mongo database so we could show them to you in a nice table in the application, 1 row per error, with line number, specific error message, what we parsed from the line, and the raw line itself – all to make it as simple as possible to fix the file and re-run the import. 1 larger company just ran all their data through on a weekly basis (basically just dumping their database every 7 days, instead of just what changed like what the app was built for), and apparently didn’t make the slightest effort to actually make sure the data they were dumping matched how they configured the import job. As a result, our error database exploded. It seemed half or more of each run couldn’t be imported. When someone from our professional services team that was contracted to help them with all their email marketing needs reached out to ask them to maybe make an effort to provide importable data, they were told “We’re {large company X} – you figure it out.” Ultimately we had to update our code to just do a dump of failed rows to a downloadable CSV file in S3 and only have the app show unique errors, the first line we encountered it, a count of how many times it happened, and a link to the downloadable file for full details.

So given the gamut of “tries their best and does a pretty good job” to “f*** it, we don’t care,” what’s the right approach to dealing with situations where there’s a bunch of data outside your system that needs to come in? Surely GIGO can’t be the best possible approach, right? Well…sadly, yes.

We may care about the quality of our code and data, but we can’t control the correctness of data coming into our systems, only it’s validity against business rule contstraints. And if the customer is going to be who’s using the data being ingested, then all we can really do is just shrug our shoulders and let them shoot themselves in the foot. Sure, it’s tempting to say “we should try to fix this automatically during the import process,” The problem with that is it involves you making decisions about other people’s data. How certain are you that you’re capable of guessing what the customer wants, every time? Because if you’re wrong, you ticked a paying customer off instead of making a useful feature. Personally, I’ve found that outside of setting reasonable defaults in my getters and setters (doable in cases where I understand the business context and know what represents the best “when in doubt, do {X}” scenario), the safest thing to do is to let customers be in charge of their own data. If you’re ingesting the data for your own use, that’s different, and you should feel free to do any clean-up that’s safe for your system.

Just because making guesses about user intent and changing their data accordingly is a bad idea doesn’t mean there’s nothing you can do about ingesting external data into your system. Like I mentioned earlier, define reasonable defaults (including null if need be) and use those. Figure out any validation rules around the fields you’re importing, and bake those into the objects (yes, you can put logic in your setVariable() methods, in fact you should be).

Validate any values that do come in, and decide what you want to do with invalid records. Do you want to fail the whole import (if the entire process needs to be atomic), or just the individual record? Can you fall back to your default values maybe (for example, if a numeric field as a minimum and maximum value, and the entered value is too big, can you safely fall back to the maximum supported value)? Effectively, the best way to do data ingestion (and most CRUD writes to be honest) is to assume the data coming in is invalid until proved otherwise.

You also need to figure out how to report errors to the user. Ideally you’re doing this in a format that makes it easy to re-submit the records after the user fixes them. In the case of our email marketing contact ingestion, errors were put into a CSV that consisted of the original row with additional notes on what failed appended. That way after users fixed their file, they could just delete the extra columns and re-submit the file as-is. If you’re processing records out of an error queue, fixing the issue should leave you with something you can drop back into the ingestion queue immediately. Exactly how this is displayed to users depends on your app, but retry-ability is something that generally gets left out of ingestion dealing with import errors.

Another practice that I have never regretted is to make the ingestion code a wrapper around existing APIs and methods. There’s a temptation when given a CSV file or told you need to do some bulk operation on all {condition} records in a database to just write a script that updates the database directly. It’s quick, it’s efficient, and it’s probably a reaction to the fact that you never wrote your code to do this in bulk. It’s also a time bomb for problems down the line, including data problems (yes, even if you validated all the fields). That’s because there’s business rules and logic around making these changes, and they’re not captured in the database, they’re captured in the code. The code that you’re bypassing to do a write directly to the database. Don’t do that, that logic and those rules are there for a reason, and that reason is so everything works the way it’s supposed to. Even if your import script is nothing more a loop that calls your service once per record, do it that way.

Other than the odd one-off, I haven’t to operate on data in bulk for a while now. That said, remember that data you’re not creating and managing directly in your systems is a wild card. It could be largely well-maintained, clean, and correct. It could be God-only-knows how people get anything done with this data. And this applies to all data coming in to your systems, not just stuff people are trying to import – API calls, even your own application if there’s a bug in the front-end validation.

We spend a lot of time building front-ends to our application to make sure that data is safe, clean, and valid. The downside is that we build these front-ends assuming people will only ever work on 1 thing at a time, and will only ever compile this data in our applications. Then we ship and realize that users have a lot of these things that they need to submit, and that a one-at-a-time isn’t going to do the trick. That results in people coming up with some sort of quick import solution for the immediate term, occasionally with a thought towards parallelizing our validation/processing later in preparation of a “proper” bulk endpoint.

Another issue is that over time, and in aggregate, data gets messy. Usually this isn’t intentional – time creates lots of opportunity for validation to fail and let something slip through that you never caught, for data to get into an inconsistent state (something changed but that never got propagated elsewhere, like into your system), or some other problem. You may not even realize it until it’s used as input in another system. Building validation in at the object level, workable defaults, and assuming data is bad until validated otherwise (during object creation, when each field is set, along with annotations like @Valid in Java) protects your code no matter how it comes into your system. It also means you can add simplistic (albeit not performant) bulk operations simply by calling this existing code in a loop. The big thing to remember about dealing with multiple input records is this: it’s not paranoia if it’s not your data. Assume the data will break your system until proven otherwise. If it’s valid but wrong…well…”garbage in, garbage out.”