Weekends are when we all want to sleep in. After a long week of work, we need time to relax, take our minds off things, and recharge. So when I got the call on early Saturday morning telling me we’re sending HTTP 500 Server Error messages to our end users, I wasn’t thrilled.
We’d recently started using the elasticsearch-py library to talk with ElasticSearch, part of rewriting our Catalogue APIs to Falcon. Unlike many libraries, this one comes with very sensible default settings. It’s resilient out of the box, randomizes the list of ElasticSearch nodes to split the load evenly, and retries queries if a selected node is unreachable. For the first few weeks all ran well. Until Saturday morning, of course.
Zdenek Styblik and I started analyzing the issues to identify the fault. It turns out that our service provider had a network partition the day of the outage. We found that some ElasticSearch nodes could not talk to each other, while some of the clients were still able to talk to the ElasticSearch nodes which were out of sync with the rest of the cluster.
We decided to fix the service for the customers and investigate the issue later. We removed the problematic ElasticSearch nodes from DNS, and the day was saved. But momentarily defusing the situation isn’t enough, of course.
To reproduce the issue, Zdenek and I set up a 3-node ElasticSearch cluster. Then we cut communication to one of the nodes using
iptables. The node that got cut off actually realized that something is wrong and even reported the error into the logs. So far, so good.
Now you would expect that if the node “knows” that it is not in a healthy state, it should either stop receiving new connections or start responding with some error message. Unfortunately, what happened is that queries against the affected node were hanging indefinitely (or at least longer than we were willing to wait).
After further research we found a configuration option called
discovery.zen.no_master_block, which is by default set to only fail on writes, but not on reads. I have to admit, I find that a little short-sighted. But at least it’s easy to fix.
Stretching the Defaults
After changing the value from
all the node started responding with errors (as expected). And because elasticsearch-py will automatically retry the query on another node if the first responds with an error, everything was working nicely even with one node being broken.
Now we’ve rolled out this new setting to all nodes, and we’ve not served any HTTP 500 errors due to elasticsearch misconfiguration.
What did we learn aside from problems always happen at the most inconvenient time?
First, this shows even the best tools and libraries aren’t infallible. A simple default that you’d never otherwise notice could make all the difference.
Second, it may take some time for problems to manifest themselves after changing to a new tool. It pays to be vigilant.
Finally, this experience helped me appreciate how valuable sharing with the community actually is. I might have taken many days to find this on my own, but the fact that others had posted their solutions made our work far easier. (This is part of why I am sharing my experiences here.)