In the first part of this series, we explained how Showmax got into the live streaming business and the problems we faced while developing our redundant setup. Now, the fun stuff - here’s a little dive into why our live encoding checker was sometimes forced to kill FFmpeg with a SIGKILL signal.
Use the force
As we mentioned, our encoders have a component that checks the state of a live stream, identifying the transcoding process as broken in these cases:
- Network issues (high latency, network connection interruption, etc.)
- Input stream corrupted (e.g. remote source overheats)
- Encoder resource issues (encoder exhausts its RAM, CPU load gets too high, etc.)
All of these issues have one thing in common: encoding speed falling below 1.0x. This is why we use the speed of live encoding as the main decision metric. Interestingly, to fix a broken live stream, the live checker relies on a simple rule of thumb: try turning it off and on again.
This mechanism looked promising! We had tried to simulate a failure scenario and were expecting that the players would recover. Unfortunately, this wasn’t always the case. Sometimes the players stopped and never restarted - mainly because the USP received an End-Of-Stream (EOS) sequence from FFmpeg when it stopped.
The EOS sequence consists of the following 8-bytes, called an empty MFRA (Movie Fragment Random Access) Box, signaling that there is no further content:
00 00 00 08 6d 66 72 61
In our self-healing scenario, we did not want the stream to be marked as “finished.” We only wanted to restart the encoder and keep the players waiting until it comes back up. Even though this would create a discontinuity in the stream, the stream would eventually resume.
As it turns out, this required a somewhat heavy-handed approach. Instead of gently quitting the encoder in the event that the live checker deems the stream as broken, it would send a SIGKILL signal to prevent the process from sending the EOS sequence. Although the USP was not getting any data, it did not think the live stream was finished, and everything worked as expected. Sometimes, the easiest solution is the best solution.
Why so hacky?
We are pretty sure that you think that’s quite hacky. First, you’re right. But…
…When we held a meeting about writing this post, we went back through the FFmpeg documentation to gather some source materials. Then, we searched the source code on Github and we found something familiar. An undocumented flag called
skip_trailer: Skip writing the mfra/tfra/mfro trailer for fragmented files. Just a quick proof-of-concept gave us a glimmer of hope.
$ ffmpeg -v 0 -i Sintel.2010.1080p.mkv -movflags isml+frag_keyframe -f ismv - | tail -c 8 | hexdump 00 00 00 08 6d 66 72 61 $ ffmpeg -v 0 -i Sintel.2010.1080p.mkv -movflags isml+frag_keyframe+skip_trailer -f ismv - | tail -c 8 | hexdump 00 00 00 00 00 00 de a2
We thought we had found a way to replace SIGKILL. We prepared changes for the encoder, adding
+skip_trailer to the FFmpeg command line so that it did not send an EOS signal anymore. We also appended a new part of the code to the live-encoding pipeline to post an empty MFRA box instead of FFmpeg at the point when we knew it’s really done.
The first tests showed us that some of the streams were not stopped even if the USP returned the HTTP response
200 OK on ingesting EOS bytes. We tried to check the states of streams via
/statistics at the USP’s endpoints and retried posting EOS if needed. But some of the streams were still magically turned back to
started state from
stopped. At the moment we are still not able to say why, but we’re eager to change that.
Some unconfirmed theories are that it may have something to do with USP’s dual ingest (EOS is sent from the new connection), or the expiration time from last-ingested stream data (found in the SQLite database where the USP stores stream metadata). Sending EOS bytes after some delay seems to be working as desired.
We know we didn’t find everything we needed to know in the documentation. We need to spend more time with source code browsing and reverse engineering to understand every nuance in this complicated process.
We have applied the same try-and-see approach for our next technical challenge - trimming the live stream. You can read about it in our next piece in this series.