Nov 072014

1 of the last projects I worked on at my previous job involved aggregating, storing, and querying log data into and from Elasticsearch (yes, I know that Logstash does that – and in reality I should have gone that route). That, along with some lookups on the data outside of the code, gave me a chance to start playing with Elasticsearch. After my brief experience with it, I can tell you there’s a lot of power in Elasticsesarch, but it’s going to take you a surprisingly longer to figure out how to tap it than you would expect.

Elasticsearch gives you a lot of options for querying and filtering data. Looking at Elasticsearch’s documentation takes you to notes on 39 different types of queries and 27 different types of filters. That’s not counting Elasticsearch’s aggregations, which is bascially Lucene faceting with the ability to perform other operations on the faceted data. Point blank, however you want to slice and dice your data, odds are Elasticsearch has something for it.

In fact, Elasticsearch has a very large JSON-based query DSL powering it. It’s very verbose, with plenty of options for tweaking that part your query as you see fit. On the 1 hand, this makes adding/altering/removing components from queries easy as each type of query is its own distinct object in the DSL. The downside to this is, you’re submitting these queries via an HTTP POST, rather than a GET. Elasticsearch does provide a document API, but compared to the DSL it’s extremely limiting. In other words, unless you’ve built a tool for testing out queries and you’re not interested in something as simple as “does this value exist” or “does this field contain this value” (there’s stuff like limits and sorting available too), you’re going to need to use something like the Advanced REST Client in the Google App Store, write a script where you can hard code in your own Elasticsearch query, or you’re going to have to stop and write a tool that builds and runs queries for you in order to really browse your Elasticsearch data.

It’s one thing if you have an app that allows you to just browse your data, but coming from working with an entirely GET-based API that could run all queries in a browser window, this limitation was a frustrating. I had been conditioned on being able to just open a browser tab and be able to run any query I could think of on the data we had, and Elasticsearch doesn’t really make this possible. Basically, to be able to answer any questions about the data you have in Elasticsearch, you’re going to need to write full-functioning search code, even if your main use case is just a very simple query.

One area where Elasticsearch truly shines is with its aggregations. They’re an evolution from faceting that offer a lot of power when it comes to examining your data. While base aggregations are your typical facets, there’s also aggregations to find minimums, maximums, averages, percentiles, and a variety of other statistical funness. There are also aggregations that let you aggregate data from parent documents, and even aggregate documents together based on a common field value (essentially grouping – see this StackOverflow question to see how to make it work).

Another wonderful bit about aggregations in Elasticsearch is that you can nest them together. I don’t mean you can aggregate on 2 different criteria, I mean you can aggregate on a field, and for every item in that aggregation, do another aggregation. However, there is something you need to be concerned about if you’re going to do a lot of stuff like aggregating in Elasticsearch. Elasticsearch tries to break field values up into chunks that can be easily parsed. That means things like spaces or dashes create distinct items in your aggregations. In other words, if you aggregate on a field that has the value “This is my text” or “my-server-name” (that last one mirrors what I ran into playing around with Elasticsearch aggregations in Java), and you get aggregations with values “This”, “is”, “my”, and “text” or “my”, “server”, and “name”.  There’s a way to configure Elasticsearch so it doesn’t break up those fields into tokens for aggregating purposes, but it’s not a default and you’re going to have to explicitly set that when you first create the index.

1 area of frustration I had was Elasticsearch’s learning curve. Put succinctly, it’s steep. If you want to do anything in Elasticsearch, you’re going to have to do it the Elasticsearch way, and there is no tolerance for error. Granted, once you do things the Elasticsearch way, it works very well. In the meantime, you’re going to have a lot of queries that are going to error out on you, and Elasticsearch’s error responses are less than helpful – just that the query was bad and couldn’t be parsed. No, “I saw this here, and I was expecting that“, no “Hey, you forgot a required field”, just failure. Elasticsearch query errors are just like C/C++ segmentation faults, and I for 1 don’t appreciate it, at all.

Here’s an example of the types of how getting used to Elasticsearch can be irritating – try querying for documents matching against 2 different fields. The best you can do is a query and a subsequent filter. Stuff like this is what I mean about Elasticsearch being very unforgiving about how you query your data. Trying to match 2 fields seems like an easy thing to do, until you try to do it in Elasticsearch. There’s no logical AND operator, and the boolean query doesn’t allow you to have 2 items in your “must” clause, anything beyond the first goes into a “should” clause, which helps Elasticserach rank the results, but matching the “should” clause isn’t a requirement. By the way, this is (what should be) a fairly common problem with a fairly simple solution. I’m sure there’s a way to do it in Elasticsearch, and I’m sure it makes complete sense once you get used to the “Elasticsearch way” of doing things, but figuring out that way can drive you crazy.

When you are developing something using Elasticsearch, you’ll have some good libraries at your disposal to help. Both the Java and Python libraries do a good job of letting you form Elasticsearch queries as close to the query DSL as possible. Python’s Elasticsearch library also has some terrific helper code for things like scanning all the results from a query and bulk operations that handles all tricky bits for you.

Overall, Elasticsearch is very powerful and very fast, and very well-suited for web applications. It’s also very particular, and you need to be willing to spend the time getting your head around how Elasticsearch thinks you’re supposed to do things. Once you get used to that way of doing things though, Elasticsearch is great. In the meantime, there’s plenty of great libraries out there to help you run your queries. I never got chance to really get over Elasticsearch’s learning curve, but I managed to get into it enough to appreciate the potential that’s there.

 Posted by at 12:28 PM