Feb 282022
 

I was working on hotfix that turned hotter than I wanted during the deployment, all because we missed an important test case that left us scrambling to resolve an issue during the deployment that we should have caught earlier. We got lucky and fixed it during the deployment, but we shouldn’t have been in that position because the scenario that failed was a known requirement, which means we should have tested it. So what happened? Well, point blank, we forgot a test case.

The problem

Our application lets customers pull a PDF file based on a document ID. Like many basic IDs, this one was sequential, and like many customer-specific endpoints, this PDF had customer-specific data that shouldn’t really be shown to other people. Our endpoint to pull this data took the document ID, looked up the document, and returned it. Because our endpoint wasn’t really an API endpoint, but rather just a back-end call, we didn’t have security on the request that we should have (starting with, don’t accept incremental ID values). As a result, as long as you had logged into a user account, you could just adjust the document ID and pull data you shouldn’t see. Obviously, that’s a problem.

Lessons learned

This one is more like a painful reminder – just because your endpoints aren’t really an API doesn’t absolve you of securing them like they are. Rest calls can easily be reverse-engineered into a API calls, so “this will only be used from the app” isn’t going to cut it. While you’re at it, once again, don’t accept raw ID values for data lookups, especially when you’re talking about sensitive data.  Use something that can’t be incremented and just sent back to the server. And validate the request – either link the ID to the active session or make the user confirm their identity before you return data.

The fix

OK, so mistakes were made and now we need to clean up our act. We had an internal service that offered an encryption endpoint, but in the process of confirming it would work for this scenario, we noticed a key management flaw that also needed to be patched. The basics are pretty straightforward, update how we handle keys, generate a new key to use with the updated key management, and then start encrypting our document IDs for use across the wire. That part went pretty well, the new changes worked in the context where we originally found the issue. We stumbled across a regression with the existing services that used the old version of the encryption logic during QA, which is annoying but that’s why we test.

Regression #1

The regression was occurring when trying to confirm a (encrypted) ID on a record from this other system against the ID from ours.  It’s important to note that this was happening in 2 places: inside our app and when clicking a link from an email confirmation our system sent.

At the time (we were working on a system we didn’t have much experience with in order to spread expertise amongst scrum teams), we thought the problem was that the value we were confirming against was saved as an encrypted value in the database (spoiler alert: it wasn’t). This led me to believe we were looking at a backwards compatibility problem, existing data was encrypted under the old key, and new data was encrypted under the new key, but the changes had replaced the old key. That didn’t seem like a big deal – just add the original encryption/decryption key as a fallback to support existing data.

Regression #1 fix

The reality was that in adding backwards compatibility I wound up fixing a bug in the ID comparison logic that only fixed things in our application (don’t forget, this encrypted ID was being emailed too). Basically, under our new key management/encryption scheme, additional metadata is included in each encryption so that you can encrypt the same string twice and get different ciphertext each time. Our old key management/encryption logic didn’t do this, so we could just compare ciphertext. That wasn’t a big deal, just decrypt the values and compare the plaintext.

When we re-tested in our application, everything worked. At this point, we assumed that email links would work, because we thought that we had a scenario where both the encrypted ID being sent for confirmation and the data from the database would always be encrypted using the same keys.

Lesson learned #1

Rather than assuming how this system behaved, I should have looked up the code block that pulled the data from this other service (just because another sprint team did most of the work on it didn’t mean that the code wasn’t something other members of our larger development team didn’t have access to) and confirmed where the encrypted value we were comparing against came from. Then I would have seen that it was being encrypted in transit, not in the database. That mistake is going to bite me in the butt later.

Regression #2 – a deploying to production story

Just before the deployment, an engineer created a record, with encrypted ID, on the other system to smoke test our backwards compatibility in production. After the deployment, we checked everything in the application, and it worked. Hooray – the process succeeded…right? We thought so too, until that engineer clicked the link from the email he got just before this process started…and it didn’t work.

So what happened? After digging into values from our logs and going through the code line-by-line, we realized our mistake. Do you remember back up in Regression #1 where I thought the encrypted value we were using for comparison was saved as an encrypted value in the database, leading me to assume in the Regression #1 fix that I’d either be dealing with 2 values encrypted with the old key or 2 values encrypted with the new key, and how I said in Lesson learned #1 that was going to bite me in the butt? It turns out, that’s because we encrypted those IDs in transit, everything worked in the application because records created before the update had their IDs encrypted under the new key during the validation between systems, so everything worked. But links in emails don’t get updated when you update your web services – they’re static, and that created a scenario where the system I’m checking against had an ID encrypted with the new key, but the link had the ID encrypted with the old key, and my logic fell apart. The original code looked something like this:

@GET("/documents?id={encryptedId}")
@Produces(MediaType.APPLICATION_JSON)
public Response getDocument(@QueryParam("encryptedId") String encryptedId) {

    // encryptedId is a required parameter.
    if (StringUtils.isNullOrEmpty(encryptedId)) {
        return Response.statusCode(400)
            .build();
    }

    Document document = getDocumentFromExternalSystem(encryptedId);

    /* 
     * Check the IDs to be sure they match, otherwise return a 404 
     * since we couldn't find a document associated with the user.
     */
    String suppliedId = decryptIdV2(encryptedId);
    String documentId = decryptIdV2(document.getEncryptedId());

    // Maybe these were encrypted with the old key, double-check before failing. 
    if(!suppliedId.equals(documentId)) {
        suppliedId = decryptId(encryptedId);
        documentId = decryptId(document.getEncryptedId());

        /*
         * If this fails, return 404 because we couldn't find a document based on
         * supplied ID
         */
        if (!suppliedId.equals(documentId)) {
            return Response.statusCode(404)
                .build();
        }
    }

    return Response.ok()
        .entity(document)
        .build();
}

It’s worth noting that when decryption methods failed (the key types changed when we updated key management, so decryption of old ciphertext with the new key failed and vice versa), text that we could easily check for was returned. What I should have done was check each decryption response individually for a failure, and if so fallen back to the old key for that ID. So, this:

@GET("/documents?id={encryptedId}") 
@Produces(MediaType.APPLICATION_JSON) 
public Response getDocument(@QueryParam("encryptedId") String encryptedId) { 

    // encryptedId is a required parameter. 
    if (StringUtils.isNullOrEmpty(encryptedId)) { 
        return Response.statusCode(400) 
            .build(); 
    } 

    Document document = getDocumentFromExternalSystem(encryptedId); 
    
    /* 
     * Check the IDs to be sure they match, otherwise return a 404 
     * since we couldn't find a document associated with the user. 
     */ 
    String suppliedId = decryptIdV2(encryptedId); 

    // TODO: ENCAPSULATE THIS FALLBACK LOGIC IN 1 DECRYPTION 
    // FUNCTION NOW THAT WE'RE DOING THIS AT THE INDIVIDUAL ITEM LEVEL.
    /* Make sure the V2 key was the right choice, if not, use the V1 key. */
    if (decryptionFailed(encryptedId, suppliedId) {
        suppliedId = decryptId(encryptedId);
    }

    String encryptedDocumentId = document.getEncryptedId();
    String documentId = decryptIdV2(encryptedDocumentId); 
  
    /* Make sure the V2 key was the right choice, if not, use the V1 key. */
    if (decryptionFailed(encryptedDocumentId, documentId) {
        documentId = decryptId(encryptedDocumentId);
    }

    /* 
     * If this fails, return 404 because we couldn't find a document based on 
     * supplied ID 
     */ 
    if (!suppliedId.equals(documentId)) { 
        return Response.statusCode(404) 
            .build(); 
    } 
 
    return Response.ok() 
        .entity(document) 
        .build(); 

}

 

Luckily, this was a quick change to make and we were able to test it quickly and have a new build ready to deploy inside the deployment window. Outside of a very stressful hour or so, this bug wasn’t an issue for customers.

Lessons learned #2

We didn’t test the email link because we assumed it was going to be the same as the link in the application. In reality, the link in the application was generated on the fly, using the new encrypted ID after the update. We should have tested the email link because that’s static and guaranteed not to change post deployment. The moral of this part of the story was, just because something should have behaved exactly the same as another part of the application wasn’t a guarantee that it would. Ensuring this is literally the point of testing.

Lessons learned #3

What would have been really helpful, across the board, would have been a list of scenarios that had to be checked because I touched the service that I did. This introduces a new problem of good documentation and keeping it up-to-date, but it’s vital when you have people like me who aren’t familiar with a system trying to make sure their code changes work. A formal list of tests to run for any given service would have not only told me what I needed to do to test code I wasn’t familiar with, but it’s also a good checklist to follow even on code that I’m familiar with, similar to how airline pilots use checklists regardless of how familiar they are with the plane.

I mentioned that keeping documentation up-to-date is its own problem, and I don’t want to let that just be a passing comment. Keeping documentation current is something that’s developers are so notoriously bad at that it’s a common target of automation. I think the best approach may be to work with the product owners to build product-wide lists of acceptance criteria. Acceptance criteria are the canonical list of how the product is supposed to work, and because this list comes from product owners, and forms the acceptance criteria of our tickets, it encourages us to make sure those are thorough and complete during grooming. And, not to sound lazy, but product owners are likely going to be a lot better at keeping the “official list” of product-wide acceptance criteria up-to-date than developers are.

As far as screw-ups go, this is fairly minor. It got fixed during the deployment period, which is about as good as you can hope for once you fail the “don’t have any bugs in the first place” test. Largely, this incident was a good reminder about how important up-to-date documentation is, especially testing plans. Even better would be making sure we had automation for those plans, so this could have been discovered with a simple  mvn clean test, or an automated integration test.  Anything other than relying on us having to remember every possible scenario to check. Going forward in tackling technical debt, that’s definitely something we’re going to be keeping in mind.

 Posted by at 11:45 AM