Mar 312021
 

During some recent performance testing at work, we noticed an endpoint showing slower than usual response times. Digging into the calls, we found that on slow responses, virtually all the time was spent on waiting for a specific lambda function to return data. Digging into that function we found several calls that appeared to be timing out, and many others that were taking 9-10 seconds. This lead to a day or 2 spent reading up on the workings of AWS lambdas and how they work with Java. What was probably most striking is that while the solution to our problem was eventually findable with some Googling – and the “fix” involved practices that are mentioned in the official AWS documentation – they’re almost mentioned in passing, and don’t call out the performance impacts they have on your billed runtime.

The root cause of this problem was that as our usage scaled up, so did the number of concurrent lambda invocations. We figured out pretty early on that the 9-10 second durations were cold starts, but that didn’t explain our (30-second) timeouts. Investigating further, we found both issues happening in production, going back multiple releases. That wasn’t particularly encouraging, but it did explain why errors were showing up in code that hadn’t been changed for several months. What that didn’t explain, however, was how support wasn’t getting flooded with calls (this is a call that gets made in the background immediately upon logging to populate the main dashboard, not to mention is related to 1 of the most commonly-accessed parts of our application). Our working hypothesis is that from the user’s perspective the failure is intermittent, and goes away with a page refresh, so there’s no need to call support.

So with recent changes conclusively ruled out, we were left with 2 longstanding problems – the slow executions that eventually completed successfully, and the timeouts. Checking the Cloudwatch logs showed the timeouts had 2 log entries, the default entry stating that the lambda started, and then the error message that it timed out, so…not helpful at all. 1 quick Python script later and we’re able to reproduce the behavior in a development sandbox, and now we’re off to analyzing the issue. Fleshing out the logging revealed 2 pain points, the first object we initialized (which took about 5 seconds during a cold start), and the first DynamoDB query we ran (about 4 seconds in a cold start). Just for thoroughness, we switched the order of object instantiation and added an initial DynamoDB query and confirmed that it was always the first instance of each. Fixing these 2 pain points would take our cold start runs to milliseconds, just like when the lambda was already “warm.” Cue a lot of Googling on slow Lambda startups during a cold start, and eventually to the StackOverflow answer I linked above.

I haven’t found any explicit evidence to confirm this, but my working assumption is that AWS starts running the lambda as soon as the JVM is running and your handler function is callable. What I did learn was that running a lambda consists of 2 stages, an initialization stage and a running stage (this is the one that shows your billed duration). I also learned that the initialization phase generally has more memory than what you may have allocated for your lambda (more memory means a more expensive lambda, so think of this stage as a free upgrade you should be taking advantage of).

Taking this back to our problem, initializing all our generic, reusable objects (DynamoDB client, other utility classes, etc.) in a static block before the handler method sped that part of the lambda runtime up by about 5 seconds (that’s right, 5 whole-number seconds). For a service that bills by the millisecond, that’s significant savings. While in the static block, we added a “pre-query” for DynamoDB. “Pre-query,” by the way, is the highly-technical term for just running a scan on any table for 1 item. Doing that saved approximately 4 seconds (yes, another whole-number seconds number). It’s worth noting, similar to the order in which objects were instantiated being irrelevant, it doesn’t really matter which table you do a single-item scan on. All that matters is that you connect to your database and perform an operation.

As extreme as those time savings sound, keep in mind we only had this type of slowness during a cold start, lambdas running in a “warm” container ran in milliseconds consistently. Another thing to keep in mind is that we weren’t doing anything out of the ordinary in terms of the function behavior (query DynamoDB, do a little bit of processing on the results, and return), or in how we set things up (here’s the handler method, declare the objects needed to run, and start doing things). This takes us back to the problem I mentioned at the end of the opening paragraph – there are simple steps you can take that have massive benefits to your lambda’s runtime, and the documentation doesn’t walk you through “Here’s how you structure a lambda in Java for the best performance.”

When you start writing lambda handlers in Java, they generally look like this:

import java.util.HashMap;
import java.util.Map;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import software.amazon.awssdk.services.dynamodb.model.AttributeValue;
import software.amazon.awssdk.services.dynamodb.model.QueryRequest;
import software.amazon.awssdk.services.dynamodb.model.QueryResponse;
import software.amazon.awssdk.services.dynamodb.DynamoDbClient;
import software.amazon.awssdk.services.dynamodb.model.DynamoDbException;

public class MyLambdaHandler<MyInput, MyOutput> {

    private static final Logger logger = LoggerFactory.getLogger(MyLambdaHandler.class);

    /*
     * Example adapted from https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/javav2/example_code/dynamodb/src/main/java/com/example/dynamodb/Query.java,
     * mostly because it's faster easier to look up AWS's examples for dealing with DynamoDB than to try to remember everything going through the Javadocs.
     */
    public MyOutput handleIt(MyInput myInput, Context context) {
        
        DynamoDbClient dynamoDb = DynamoDbClient.builder()
            .region("us-east-1")
            .build();

        Map<String, String> placeHolderToAttributeName = new HashMap<>(); 
        placeholderToAttributeName.put("#keyField", "myKey");

        Map<String, AttributeValue> placeholderToAttributeValue = new HashMap<>(); 
        placeholdertoAttributeValue.put(":keyValue", AttributeValue.builder().s(myInput.getKeyValue()).build());

        QueryRequest query = QueryRequest.builder()
            .tableName("MyDynamoDBTable")
            .keyConditionExpression("#keyField = :keyValue")
            .expressionAttributeNames(placeholderToAttributeName)
            .expressionAttributeValues(placeholderToAttributeValue)
            .buld();
 
        try {
            QueryResponse response = dynamoDB.query(queryRequest);
            MyOutput myOutput = new MyOutput();
            myOutput.processData(response);
            return myOutput;
        } catch (DynamoDbException e) {
            logger.error("Error querying DynamoDB!", e);
            // Just returning null would cause problems in the real world, but error handling is situation-specific, so I'm just leaving the placeholder here.
            return null; 
        }
    }
}

 

Comparing this to the examples in the official Lambda documentation, and things are pretty close. Oh, they’re initializing some constants outside the handler, and a blurb that you can initialize things outside the handler that are re-used between invocations, but there’s no mention of this being a best practice. In fact, our hypothetical code should look more like this:

import java.util.HashMap;
import java.util.Map;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import software.amazon.awssdk.services.dynamodb.model.AttributeValue;
import software.amazon.awssdk.services.dynamodb.model.ScanRequest;
import software.amazon.awssdk.services.dynamodb.model.QueryRequest;
import software.amazon.awssdk.services.dynamodb.model.QueryResponse;
import software.amazon.awssdk.services.dynamodb.DynamoDbClient;
import software.amazon.awssdk.services.dynamodb.model.DynamoDbException;

public class MyLambdaHandler<MyInput, MyOutput> {

    private static final Logger logger = LoggerFactory.getLogger(MyLambdaHandler.class);

    private static DynamoDbClient;

    static {
        dynamoDb = DynamoDbClient.builder()
            .region("us-east-1")
            .build();
        
        // I don't care about the response, I just want to connect to the database before this function starts actually handling requests. 
        ScanRequest warmupScan = new ScanRequest.builder()
            .tableName("MyDynamoDBTable")
            .limit(1);
       try {
           dynamoDB.scan(scanRequest);
       }
    }

    /*
     * Example adapted from https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/javav2/example_code/dynamodb/src/main/java/com/example/dynamodb/Query.java,
     * mostly because it's faster easier to look up AWS's examples for dealing with DynamoDB than to try to remember everything going through the Javadocs.
     */
    public MyOutput handleIt(MyInput myInput, Context context) {
        
        Map<String, String> placeHolderToAttributeName = new HashMap<>(); 
        placeholderToAttributeName.put("#keyField", "myKey");

        Map<String, AttributeValue> placeholderToAttributeValue = new HashMap<>();
        placeholdertoAttributeValue.put(":keyValue", AttributeValue.builder().s(myInput.getKeyValue()).build());

        QueryRequest query = QueryRequest.builder()
            .tableName("MyDynamoDBTable")
            .keyConditionExpression("#keyField = :keyValue")
            .expressionAttributeNames(placeholderToAttributeName)
            .expressionAttributeValues(placeholderToAttributeValue)
            .buld();
 
        try {
            QueryResponse response = dynamoDB.query(queryRequest);
            MyOutput myOutput = new MyOutput();
            myOutput.processData(response);
            return myOutput;
        } catch (DynamoDbException e) {
            logger.error("Error querying DynamoDB!", e);
            // Just returning null would cause problems in the real world, but error handling is situation-specific, so I'm just leaving the placeholder here.
            return null; 
        }
    }
}

Note the  static block where I’m not only creating the DynamoDbClient, but am also just do a scan on the table for 1 item, without any regard for what comes back. Remember, that’s because the DynamoDBClient isn’t going to actually connect to your database until you actually try to perform an operation on it. So, while we’re getting everything set up before our function first runs, why not ping the database to go ahead and go through the SSL handshake process that happens on that first call? Now by the time we enter the function handler, we’ve got a live database client with a hot connection, even though our lambda was booting up from a cold start. The results were incredible – our performance with a cold start was similar to that from a warm start.

Disclaimers you want to keep in mind about all of this

First things first – premature optimization is the root of all evil, but in this case I think we can safely say it’s worth it. I’m basing this on 2 facts – the first being that these changes improved the performance of the code by seconds (not milliseconds, seconds). The other is that, as you can see from the example, it’s pretty easy to do in your code. Even refactoring existing lambda functions would likely be more time-consuming that it is difficult or risky. The levels of payoff, effort needed, and risk involved all combine to, in my opinion, qualify this this sort of refactoring as low-hanging fruit.

While we’re certainly improving the billed duration metric, we shouldn’t kid ourselves that we’re actually saving nearly that much “clock on the wall” time. It still takes time to get the JVM up and running enough to instantiate objects, and establishing the connection to a remote service like DynamoDB is going to take a few seconds just because you have all the associated back-and-forth that comes with an initial connection. All these changes do is push that time farther back in the process, to the initialization stage (at this point, the lambda function generally has access to more memory than the execution context would have been allotted, so there is some improvement). Effectively, we’re making the fact that this part of the code is slower Amazon’s problem.

It’s worth noting there are other ways to solve the cold start problem – specifically by provisioning enough capacity for your lambda function that you’re guaranteed to cover your baseline usage. That’s going to cost you extra, but depending on how often your function needs to run, it may be worth it. The downside is that if your function has to scale up to address a spike in usage, you still have cold starts to deal with. The same holds true if your lambda is getting naturally invoked often enough to only cold start during traffic spikes or when you update the lambda code. You can mostly ignore this problem (especially if you have fairly consistent loads on your lambda functions), but any cold start problems are still lurking whenever you update the lambda code or need to scale.

All of this is probably just fairly standard stuff, but in Googling for tips to improve our cold start times, I never really saw anything that explicitly said “do this – it saves a lot of execution time.” I’m sure there’s similar steps that can be taken for the other lambda runtimes, but at the time we were focused on the 1 function causing problems. The biggest lesson in all of this isn’t really “here’s how to make lambdas run faster during a cold start,” but rather “here’s why it’s valuable to understand the infrastructure that’s running your code.” Nothing about this was clever coding, or a particularly innovative approach to a common problem, it was just spending more time Googling about my execution environment. The same thing applies even if you’re not using managed infrastructure like lambda. You can’t always rely on an operations or infrastructure team to solve these problems for you, so if you ever find yourself in a situation where doing some research into the framework or infrastructure you’re using seems like it may be useful, take the opportunity. Even if you don’t glean anything now, it’ll make you a better developer going forward.

 Posted by at 11:45 AM