This is the first in a series of articles about how Showmax made a radical departure from its video-on-demand roots and took a live streaming platform from scratch to production in less than three months. In particular, we’ll share some of the key issues we struggled with. In this first article, we describe how we had to downgrade our initial plans for full redundancy because things didn’t work out as expected.
Let’s start with some brief background. Showmax started life in mid-2015 as a subscription video on demand (SVOD) service. The reason for the SVOD-only approach was to capitalise on the major shift in viewing habits to binge-watching multiple episodes (and even multiple seasons) in a single sitting. Live streaming was never envisaged as part of the offering.
Fast-forward to 2017 and our business in Poland achieved much consumer love by offering original Polish content. In fact, this local content was more popular than heavyweight Hollywood shows and a major differentiator for the service. Based on this, the Poland team had the idea to bring a localised version of the world-famous Saturday Night Live to Poland, and after negotiating with owners NBCUniversal SNL Polska was born.
Well, with one slight wrinkle - as the live part of the SNL name suggests, we suddenly had to shift gears from SVOD-only to having live streaming capability. This was a major technical challenge, and one where failure would be played out in real time to the general public. The task ahead was both exhilarating and terrifying.
The initial idea of a fully redundant live streaming setup that we drew on a whiteboard looked promising. It would bolt-on to the existing SVOD architecture that uses FFmpeg for transcoding video to multiple bitrates and Unified Streaming Platform (USP) for streaming the content across any platform and device (including protection, e.g. DRM).
We had two single bitrate stream inputs provided by ATM SYSTEM, with the master in Warsaw and the slave in Wroclaw. Each of the streams would be transcoded by two encoders into multiple bitrate streams and ingested to USP origins as per the diagram. This was exactly as recommended by USP’s documentation on redundancy and dual ingest (failover).
This setup would produce two fully redundant output streams, primary and backup stream.
#EXTM3U #EXT-X-STREAM-INF:BANDWIDTH=592000,AVERAGE-BANDWIDTH=538000,CODECS="mp4a.40.2,avc1.4D401F",RESOLUTION=640x360,AUDIO="audio-aacl-98",CLOSED-CAPTIONS=NONE https://usp-live01.showmax.com/l/4545a5ba-8c7c-457d-a136-2010140f0442/4545a5ba-8c7c-457d-a136-2010140f0442-audio=98304-video=409000.m3u8 #EXT-X-STREAM-INF:BANDWIDTH=592000,AVERAGE-BANDWIDTH=538000,CODECS="mp4a.40.2,avc1.4D401F",RESOLUTION=640x360,AUDIO="audio-aacl-98",CLOSED-CAPTIONS=NONE https://usp-live02.showmax.com/l/f3f39fe5-a51e-4719-81e0-2bd05414b2fa/f3f39fe5-a51e-4719-81e0-2bd05414b2fa-audio=98304-video=409000.m3u8 #EXT-X-STREAM-INF:BANDWIDTH=3735000,AVERAGE-BANDWIDTH=3396000,CODECS="mp4a.40.2,avc1.4D401F",RESOLUTION=1280x720,AUDIO="audio-aacl-131",CLOSED-CAPTIONS=NONE https://usp-live01.showmax.com/l/4545a5ba-8c7c-457d-a136-2010140f0442/4545a5ba-8c7c-457d-a136-2010140f0442-audio=131072-video=3072000.m3u8 #EXT-X-STREAM-INF:BANDWIDTH=3735000,AVERAGE-BANDWIDTH=3396000,CODECS="mp4a.40.2,avc1.4D401F",RESOLUTION=1280x720,AUDIO="audio-aacl-131",CLOSED-CAPTIONS=NONE https://usp-live02.showmax.com/l/f3f39fe5-a51e-4719-81e0-2bd05414b2fa/f3f39fe5-a51e-4719-81e0-2bd05414b2fa-audio=131072-video=3072000.m3u8
Issue #1: Redundancy and dual ingest setup
The proposed setup did not work as expected. When the camera was aimed at our small improvised stage featuring a desk clock, the clock produced a broken live stream. The second hand jumped backwards and forwards, and playlists contained discontinuities after every chunk. Even though the documentation describes this situation thoroughly, it took us a while to actually understand its implications. It turned out that ingested streams must be time-aligned very precisely because USP origin keeps the chunk that arrives first while the second one to arrive is discarded. In our case, it was ingesting chunks alternately from both encoders, which was the root cause of the trouble. Unfortunately, we were not able to synchronize the FFmpeg encoders correctly. Here is where our fully redundant setup started to fall apart. Each USP now only had a single encoder feeding it data.
Issue #2: HLS failover
We still had a chance for failover on the client side video player by using HLS Redundant Streams. We’ve configured our infrastructure to take the playlists generated by the USP and inject the URLs of streams served by the other USPs as redundant streams. In essence we were creating meshed playlists.
Despite the big expectations, still no luck - the video player on the client side crashed when the primary stream was manually killed. In practice, neither of the players we tested (HLS.js, ExoPlayer and AVPlayer) could play without interruption. What we learned is that HLS failover is meant to fall back between CDNs serving completely identical streams. The only consolation was that we found we’re not alone in misunderstanding alternate streams in HLS playlists.
Issue #3: Dry-runs vs. production
In the buildup to the big day of the SNL Polska debut we performed countless dry-runs. Our list of testing scenarios covered the whole pipeline from ingesting on encoders to playback on the client side. We tried to simulate production broadcasting as much as possible to be prepared properly for D-Day.
Fast-forward in time, with only a few minutes to go we got a call from ATM telling us to have only one encoder consume the stream from each source. Apparently, their Teradek Cube 155 was overheating - two encoders connected were simply too much.
During dry-runs, everything always worked like a charm. But there was one key difference - we were using a much newer Teradek Cube 655 in tests which had no issues with multiple simultaneous connections. Which leads to…
Lesson learned: Always test exactly the same setup you’re going to use in production.
After shaving off all non-working parts we ended with two redundant pipelines of Teradek, FFmpeg and USP origin (as outlined in the diagram below). Although USP and HLS provide some kind of redundancy/failover approaches, in practice we were forced to deliver our custom solution.
We implemented the custom failover on our Playback API where we added an asynchronous health check of all the redundant live streams. For the lack of better name, we called it the live watcher. The stream that is not updated regularly is considered as dead. For that purpose, USP provides
/statistics publishing endpoints to retrieve additional information about live streams.
$ curl -v https://usp-live01.showmax.com/l/4545a5ba-8c7c-457d-a136-2010140f0442/4545a5ba-8c7c-457d-a136-2010140f0442.isml/state <?xml version="1.0" encoding="utf-8"?> <!-- Created with Unified Streaming Platform(version=1.7.10) --> <smil xmlns="http://www.w3.org/2001/SMIL20/Language"> <head> <meta name="updated" content="2015-05-06T01:57:12.313929Z"> </meta> <meta name="state" content="started"> </meta> </head> </smil>
To increase the precision and reaction speed of Playback’s live watcher we added another type of live streaming health check to the encoders. Hence FFmpeg’s parameter
-progress output.txt was added to transcoding command-line, which supplies program-friendly progress information.
frame=1887368 fps=25.0 stream_0_1_q=32.0 stream_1_1_q=30.0 stream_2_1_q=28.0 stream_3_1_q=35.0 bitrate=913.2kbits/s total_size=8618218800 out_time_ms=75496746667 out_time=20:58:16.746667 dup_frames=0 drop_frames=0 speed=1.0x progress=continue
The decision to stop/restart encoding is made by observing the FFmpeg progress output and parsing out the encoding speed, which must stay above 1.0 (otherwise the stream might start to stutter). For debugging purposes we also logged round trip times between our encoders and the Teradeks.
The incoming player always gets a healthy live stream playlist from the Playback API. However, If the playlist dies whilst playing, the player implements a retry logic that asks the customer to reload the video. This means that a working playlist is returned again and the video playback is resumed.
The dead stream is still being monitored and if it springs back to life, it will be put back into rotation. This in essence forms a simple self-healing logic.
Next we will talk about why we sometimes need to kill FFmpeg with SIGKILL.