One Small Bug For Man, One Giant Failure For CMS-kind

This is a tale of several small issues turning into an absolute clusterf^@k. It’s never just the one bug that causes a catastrophe, it’s always a synergy of multiple small issues - small issues that are so stupid, and so easy to overlook, that it’s almost unbelievable when they happen. But they do happen.

To b or not to b

One day, we were deploying our Content Management System (CMS) to production and, considering the massive changes that went into that release (all CMS services were upgraded to run on Python 3.6, large Elasticsearch index re-juggling), it went really smooth! Or so we thought initially…

After a while, we started getting a lot of ElastAlert emails for 500s on our Image API. Two minutes later we identified the root cause of these alerts and came up with a fix.

Python 3 made a massive change into its type system, and strings are no longer the same beasts as bytes. Here’s an example:

In Python 2:

>>> 'screw_up' == b'screw_up'
True

However in Python 3 the situation is quite different (the opposite, in fact):

>>> 'screw_up' == b'screw_up'
False

The reason for this significant change is because Python 3 now handles all of its strings as Unicode, whereas Python 2 handled strings as a bunch of bytes with their interpretation controlled by the system defaults (quite often insane, I’m looking at you ASCII). Strings were strings, no matter their source (files, pipes, sockets, etc.), who cares about their encoding? It’s all ASCII in the developer world anyway, right?

Well, no:

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'

And, you might even get:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

That means that certain strings are unparsable/undecodable depending on the encoding setting. For example, coming in contact with “piękna dziewczyna” could give you a headache!

In Python 3, however, there is a distinction between bytes and strings. If you have a string (Unicode), you first need to figure out how you want to encode it to be representable by bytes. Think of composite Chinese characters that cannot be represented with an integer value of 0-255 (8 bits) - one byte.

Enter the world of UTF-8, an encoding of Unicode characters (code points) that can represent complex characters with sequences of just 8 bits:

>>> 'piękna dziewczyna'.encode('utf-8')
b'pi\xc4\x99kna dziewczyna'

Notice that the result of the encode is surrounded by b’…‘. This means that we have successfully turned our unicode string into UTF-8 encoded stream of bytes (those pesky little buggers that, for example, the read() syscall returns). Python distinguished the byte object from the regular string object when printing to console by surrounding it with b’…‘.

Now if we do that same thing for a string that contains just ASCII characters, we still get a bytes object:

>>> 'piekna dziewczyna'.encode('utf-8')
b'piekna dziewczyna'

Back in the Python 2 days, you could pass strings to where bytes were required and vice versa. Something would always happen, just maybe not the right thing.

That Stupid Little b’…‘

The Showmax CMS ecosystem is comprised of several micro-services and a large Django based content manager. The content management part allows content editors to set various details for all content in the system (movies, series, images, subtitles, etc.) called “assets”. One of the fields is the license window - our time frame for being able to host and stream a certain piece of content.

When an asset gets near its license window expiration, we automatically add a little corner overlay that says “Last Chance XYZ!” to its cover image. The idea is that this might entice the customer to quickly finish off the series or the movie before it disappears into the void of digital license hell.

Last Chance

Source: Showmax

None of our API endpoints actually talk SQL to obtain data - they all talk Elasticsearch. Every 30 minutes, the content management server takes the whole SQL database and de-normalizes it into huge JSON chunks that get uploaded into Elasticsearch. Through the sheer magic of Elasticsearch our APIs are crazy fast.

Here, Python 3 and its type system enter the scene. Before, we would just push strings and bytes around without worrying too much. Now, however, Python thinks it knows better, as you can see here:

Python 2:

>>> base64.urlsafe_b64encode(json.dumps({'overlay': 'Python rocks!'}))
eyJvdmVybGF5IjogIlJlYWRpbmcgdGhpcz8gV29yayBmb3IgdXMhIGdlZWtzQHNob3dtYXguY29tIn0=

Python 3:

>>> base64.urlsafe_b64encode(json.dumps({'overlay': 'jednorožec'}))
Traceback (most recent call last):
..[shortened]..
    encoded = binascii.b2a_base64(s, newline=False)
TypeError: a bytes-like object is required, not 'str'

>>> base64.urlsafe_b64encode(json.dumps({'overlay': 'jednorožec'}).encode('utf-8'))
b'eyJvdmVybGF5IjogImplZG5vcm9cdTAxN2VlYyJ9'

>>> base64.urlsafe_b64encode(json.dumps({'overlay': 'jednorožec'}).encode('utf-7'))
b'eyJvdmVybGF5IjogImplZG5vcm8rQUZ3LXUwMTdlZWMifQ=='

In Python 2, base64.urlsafe_b64encode() produced a string. In Python 3, it produces bytes. In this new world, this makes sense since the Base64 scheme has no notion of Unicode nor how to encode it. It only knows how to map byte (char) values 0-255 to a smaller subset. Ultimately, it’s up to the developer to set the conversion. The decoding side also needs to know what the bytes represent after decoding Base64.

Here comes the interesting part. See the b’…‘ surrounding the result in Python 3? That was the culprit behind how we incorrectly stored the Base64 representation in the Elasticsearch index.

We stored the printable string representation of the bytes object instead of properly encoding it into a string. That meant we included the b’…‘ as well. This caused a decoding exception later.

>>> base64.urlsafe_b64decode("b'eyJvdmVybGF5IjogImplZG5vcm8rQUZ3LXUwMTdlZWMifQ=='")
b'm\xec\x89\xbd\xd9\x95\xc9\xb1\x85\xe4\x88\xe8\x80\x89\xa9\x95\x91\xb9\xbd\xc9\xbc\xad\x05\x19\xdc\xb5\xd4\xc0\xc4\xdd\x95\x95\x8c\x89\xf4'
>>> json.loads(_)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
..[shortened]..
    s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 8: invalid start byte

When our various frontends need to obtain an image of an asset - a movie in this example - they first talk to our Catalogue API. This returns all of the information about a given asset, including associated image URLs served by our Image API. As described above, sometimes a special overlay needs to be stitched to the image in runtime based on certain business rules (“Last Chance XYZ!”, for example). The way this is implemented is by our Catalogue API adding a special GET parameter to the URL of the original image. Our Image API then decodes this parameter and produces the requested image. This time, unfortunately, decoding the parameter threw up HTTP 500 errors caused by the above exception instead.

We were quick to identify the problem, come up with a fix, and deploy it. So far so good - 5 minutes passed and only a handful of 500s were served out.

Let’s deploy one more extra change…

I’m hopelessly optimistic, so I decided to deploy one extra change. It had been in staging for 2 weeks, QA’ed, and it would have saved one extra deploy (great argument right? :-). It was SAFE! You know what I mean :-)

For various reasons, we decided to merge one specific Elasticsearch index with another one - this one extra change was to stop producing the now-unused index. All changes but this one were in the initial production deployment and didn’t cause any issues.

In the last section I mentioned how we pump our SQL data into Elasticsearch, and Elasticsearch then lexes and indexes them. This is the expensive part. To prevent overloading the clusters every 30 minutes with indexing the data, we have a special dedicated init cluster. We pump the data into this cluster, have it index the data, and then we use the Elasticsearch Reindex API to quickly distribute the new data across the clusters in the form of a timestamped index.

We have a special tool to automate this process. Once the data is pumped into the init cluster, the tool looks at what index versions each cluster holds and then syncs the new content. When the sync is done, it moves the latest alias to the synced indexes. The micro-services then reference specific indexes by the latest alias to get the latest data. Moving the alias is atomic and the whole state of the index gets updated from the point of view of the micro-service. If you try to sync the indexes again, you get an error saying that the index already exists and the tool throws up an exception. The intent of this check is to make sure that you know that no new indexes were synced because no new indexes were produced.

Now here comes the fun part. Since we stopped generating the old index with the new CMS deployment, all other indexes had new versions generated but the unused index did not.

This is where you want to remember that the original issue was that little pesky b’…‘. We really really needed to quickly sync our fixed overlays into Elasticsearch.

Feel the time pressure now?

Time Pressure

Source: https://www.pexels.com/photo/man-holding-pocket-watch-in-grayscale-160785/

Well, too bad! The pumping finished, sync tool started, and…it failed. The unused index was already synced and there was no new version - the tool threw up an exception!

Feel your stomach cringing now?

Cool! We need to delete the unused indexes from our production Elasticsearch clusters. Some minutes later, both clusters were cleaned up, we ran the sync tool again, and….boom!

Feel the sweat beads running down your spine?

Wait! What?! Again?! Oh! We forgot to delete the unused index from the init cluster as well! It re-indexed it back to the other clusters. This time we cooked up a quick script that deleted the indexes from all clusters and reindexed again.

YES!

What a relief, all fixed!

Another one bites the dust

After this, the number of 500s returned to our clients started to drop dramatically. We approached our kind Ops team for some help restarting the Varnishes to quickly evict the cached versions of catalogue responses still containing b’…‘. However, after almost an hour, we were still getting some - mostly for Android and tvOS clients. Are they caching the responses longer than Max-Age? What is going on? We started investigating the requests using our venerable Call Tracer, and came to the conclusion that the requests to the evicted versions were in fact artifacts of how the TV clients cache search results.

When a user searches for a movie, the search result and images are downloaded. These can contain overlays, and in this case some were broken. Nevertheless, the user selected a certain movie to play, and, once the playback finished, the result screen was brought back to the main view. Since the images were garbage-collected while off screen, they needed to be re-fetched. The thing is, our apps do not perform the search again, they just reuse the old data. This is why we got requests for the now-evicted versions containing the broken link.

Anyways, we just modified the Image API to handle both the correct and broken Elasticsearch replies. It wasn’t glamorous or perfect, but it did let us rid of the 500s.

Five minutes later the fix was done, QA’ed, and the modified file was hot-patched to production. Application servers restarted….yay no more 500s! Kibana was clear!

Clear Kibana

Source: Showmax

It’s not over yet…

With a feeling bordering on accomplishment, we started investigating where the broken requests were coming from. That’s when we saw a 500 for a perfectly regular image URL. This was a bit odd, given that those should not have been impacted by the overlay bug.

We quickly tried it on a local Docker instance and everything worked fine. Why was this simple request bombing in production?

We logged in to one of the production containers and found that the instance is dead. uWSGI reported that no Python application was found. What? C.R.A.P.! Icinga reports that ALL THE IMAGE services are reporting failed health checks.

WHAT THE F…?!

Scrambling to understand what was going on, I realized that I had made a HUGE mistake. In short, I had offlined all of the image services without us noticing…for 15 minutes. When modifying the Image API to handle both cases, I edited a file called image.py under the resources directory.

Here’s how I hotfixed it to production:

$ vim resources/image.py
$ ls
image.py .... resources/ docker/ ....
$ cat image.py | ssh image-service 'cat > /prefix/image/resources/image.py'
....

This particular micro-service has an entrypoint in a file called image.py and implementation in resources/image.py. I successfully managed to push the wrong file.

Luckily, the fix was quick and easy. I just had to push the correct file this time. Perfect time to call it a day and go get a beer(s)!

Outro

If you’ve made it this far, you’ve seen that sometimes it’s not always the super-complicated problems that cause production failures. Often, it’s about the little things - bad decisions, human error - the simple, day-to-day stuff.

Some teachable moments and relializations:

The stupid little b’…‘: In CMS, we have tons of unit tests, but NOT ONE SINGLE end-to-end test. After this specific incident, we built a roadmap to implement end-to-end tests to make sure that the entire chain of our services works correctly.

Lets deploy one more extra change: Be very conservative during production deploys. Don’t mash changes together, it is hard to untangle the mess later.

Another one bites the dust: Directly modifying code in production (hotfixing) is fast and tempting. But, heed my words: hotfixing will eventually trip you up. Have at least one more person looking over your shoulder when hotfixing as it may very well prevent the same problem we ran into.

If you enjoyed reading about what we do, do not hesitate and send us an email at geeks@showmax.com.

Cover image by Gratisography.

Please check the original version of this article at