Migrating Showmax Catalogue API from Django to Falcon - Part II (Learning from mistakes)

At Showmax we try to avoid doing stupid things. But we’re only human, so sometimes we make errors anyway. In this article I will show you how we make sure that mistakes are caught early and how we learn from them.

Something’s fishy

After we decided that we actually want to proceed with rewriting the Showmax catalogue APIs from Django to Falcon, we had to write a bunch of instrumentation to make Falcon a first-class citizen of the Showmax infrastructure.

At Showmax we’re obsessed with logging and monitoring, so obviously one of the tasks was to make Falcon do remote logging of requests and exceptions. If you want to learn more about our logging infrastructure, see our previous article on tracing distributed systems.

To enable remote logging, I wrote subclasses of logging.Formatter and logging.StreamHandler and then a short helper for getting a logger instance configured to use them.

Original implementation of the get_logger helper looked like this

 def get_logger():
    logger = logging.getLogger('showmax')
    handler = ShowmaxHandler()
    return logger

At first sight there’s nothing obviously wrong with the code. So it happily passed a code review and was deployed to staging. I let it sit there for a while and then checked the monitoring. I noticed two depressing things. First, the API latency increased dramatically. And second, the number of logs in Kibana was much higher than the number of requests hitting the API.

Simple mistake, big impact

After looking at the code once again and verifying my working theory in the world’s best engineering knowledge base, I immediately realized the stupid mistake I made in the helper. Calling logger.addHandler every time the helper was invoked caused the ShowmaxHandler to be added to the logger over and over again. This led to requests being logged as many times as the helper was invoked in the current thread. As a result, the API requests were slow and we were spamming our logging infrastructure with duplicate information.

I quickly implemented a fix (based on the knowledge base article above) and redeployed the service.

 def get_logger():
    logger = logging.getLogger('showmax')
    if not logger.handlers:
        handler = ShowmaxHandler()
    return logger

After deploying the above to staging, the logging spam immediately stopped. However the request latency remained higher than before the remote logging was implemented.

Not quite out of the woods

Obviously when you do remote logging, there is always some overhead. And if logging is performed synchronously while handling the API request, it negatively affects the latency. Since at Showmax we care about the user experience and we want to make the service as pleasant as possible, it was time to get rid of the extra latency.

As I have mentioned in the previous article, we are running Falcon on uwsgi with gevent. This allowed me to easily perform the logging asynchronously. It was just a matter of wrapping the logging statement with gevent.spawn. So instead of calling the logger directly:

 logger.log(level, (req, resp))

I just told gevent to run it in a coroutine:

 gevent.spawn(logger.log, level, (req, resp))

That was easy, huh?

After deploying this change to staging and checking the graphs again, I felt happy about what I saw there. The latency was back to normal.

Catalogue latency

Latency drop after resolving the issues with remote logging.

Culture makes the difference between mistake and disaster

By having extensive monitoring, and a culture that verifies changes before they go live, we caught the issues with the remote logging implementation in the staging environment and we didn’t let them go live.

By fixing this issue I made sure that customer experience will not be negatively affected by introducing the remote logging and that we will have sensible logs in case something goes wrong.

This is what makes Showmax different. We don’t just talk about being careful. We actively look for problems and potential solutions, but we also go the extra step and ask if there’s a problem with our new ideas. So instead of just implementing and hoping for the best, it’s everyone’s responsibility to check what we’re doing, and how we’re doing it. The result is that we avoid making dumb mistakes while still having the freedom to innovate and improve. At Showmax we own our work…especially when we make a mistake we can learn from.

Please check the original version of this article at