Divide & Encode: How to Encode Videos Blazingly Fast

Part I - Need For Speed

Here at Showmax, we’re addicted to speed and performance. In this post, I’ll go through a few techniques we used to improve the speed of our encoding pipeline, maximize the computing power of our encoders, and shortening our encoding time.

When it was all said and done, we reduced encoding times from several hours to several minutes. That means a drastically-reduced delay between our receiving the content from the providers and publishing it on our platform - so our customers can rejoice in all of this great content sooner.

I’ll begin by introducing our legacy encoding pipeline and all of its pros and cons. Then, I’ll get into our motivations to change it … in this case with our idea to use a parallel encoding pipeline.

Once Upon a Time…

In 2015, as Showmax was still in its very early stages, our encoding pipeline was created as simply as possible, following the KISS principle, to process content especially reliably. The Showmax encoder supports several pipelines dedicated to different purposes, like preparing mezzanine file, encoding video, and more.

Mezzanine assets - digital assets created in an intermediate step, especially in the video and broadcast industry. –Wikipedia

In fact, these encoding pipelines are just a set of Python scripts serializing the individual steps of processing shown on Image 1. Technically, those steps either send HTTP requests to the various internal and external API endpoints, or call some UNIX utilities like FFmpeg and rsync.

Directed acyclic graph

Image 1. Steps of encoding pipeline.

NOTE: The actual encoding is done by FFmpeg - the Swiss army knife of encoding tools. It has many (many) options - more than most users/developers would ever need to use. As my colleague Jiri Brunclik puts it: “FFmpeg can do anything, even make coffee. But there are only two people in the world who know the right parameters. :-) “

Between 2015 and 2017, our entire pipeline was running on a bunch of physical machines, with one instance per machine. Each instance was only capable of processing one video at a time. We used to have 140 EX40 machines hosted at Hetzner; later, we moved to using 50 PX61 machines.

In order to save on storage space, the encoders produce only MP4 files which are later encrypted (Widevine Modular, PlayReady and HLS FairPlay) and packaged to streams (DASH, SmoothStreaming or HLS) with Unified Streaming Platform. This is done on-the-fly, and then packaged files are stored in cache on the CDN edge servers.

Simplicity is always great, but it has its downsides. Serial processing is slow … sometimes very slow. Encoding time depends on the length of the video and the complexity of image compression (animation is usually easier to compress than live-action video). In our case, the encoding process usually took 3-4x the length of the video.

Generally, the slow encoding pipeline was not really a big deal. You know, if it ain’t broke, don’t fix it. But there are situations when encoding time really does matter:

  • Publishing high priority content:
    Sometimes we acquire content shortly after its premiere and it needs to be published as soon as possible. The time gap between ingesting videos to our system and expected publish time is limited. Tasks in the queue can be reordered by prioritization. But even prioritized task still must wait for a free machine if the queue is full. For extra urgent content, we had dedicated queues with more powerful machines. However, both solutions are suboptimal.

  • Maximizing utilization of the encoder farm:
    When the encoding queue is full (size is greater than the count of encoders in the farm), it means every encoder is processing one video and computing power is being used to the max. But when the encoding queue is not full, a part of the encoders is just slacking. Our encoding farm of 140 machines used to have only 12% CPU utilization on average.

  • Inexpensive scaling:
    More expensive machines could improve the encoding speed, but the price can be too high in comparison to the relatively small gain (eg. in our tests of encoding performance PX61 was comparable to the twice-as-expensive PX121). It’s also called vertical scaling and just means adding more power (CPU, RAM) to an existing machine.

  • Incorporating next-generation codecs:
    Codecs like H.265/HEVC, VP9, and AV1 have a great impact on quality, as they can compress video with lower bitrate while keeping the same quality as H.264. But the encoding speed is terribly slow (but it’s improving). The speed of VP9 encoding is still ~20x worse than for H.264 and it would cause an increase of the encoding time from hours to days.

Divide & Encode

We concluded that vertical scaling was not the way we wanted to go, mainly because of the small improvement relative to cost. We did some experiments with encoding video to all variants at once with one FFmpeg instance and it yielded a performance boost of about 22% with really minimal effort in pipeline rewriting (thanks to the decoding of input video only once). That’s a good start.

We decided to focus on horizontal scaling more and engaging more machines to process one pipeline. If every variant could be processed by one encoder, it would consume 16 machines (that’s the count of our variants). But it’s not sufficient for us - currently our farm of encoders contains 50 PX61 machines. By splitting video to smaller parts (segments) we could use more (or all of the) machines to encode and we’d utilize our encoding farm to its full potential.

Caveats: Well, it’s not our idea originally, it’s just about taking an existing approach and applying it to our ecosystem. Despite this, we’d like to share some of our design details and the struggles we had.

The proposed parallel pipeline can be displayed as directed acyclic graph (DAG) shown on Image 2 below. The nodes represent parts of pipeline (AKA “jobs”) that can be processed independently on separate machines, and the edges define dependencies between nodes (define the order of processing of jobs). From the top of the image, the video is split to chunks. The processing of chunks is distributed to the encoding farm (note that the count of chunks can vary based on the length of video). After that, the chunks are merged to the final video MP4 file and uploaded to the storage servers. Then, content is registered with our DRM providers. The last/bottom node of the DAG is just the internal garbage collector - the results of individual jobs are stored in the shared file system. Moreover, every job can have defined requested features specifying what capabilities the machine must have to acquire the job. Based on that, we are able to have machines with different hardware and power, and control what jobs they should process.

Directed acyclic graph

Image 2. Graph representation of parallel pipeline.

All of the issues I mentioned before were solved by using the parallel approach:

  • Publishing high priority content:
    The feature of prioritization is still valid, but much improved in this case, as the individual jobs are shorter from start to finish. The jobs submitted later, and with higher priority, can start being processed sooner in the full queue. As a consequence we don’t need the dedicated machines either.

  • Maximizing utilization of the encoder farm:
    The parallel encoding pipeline is designed to minimize idle power, which results in improvement in processing speed. By splitting work between more machines, we’re always (trying to) use the maximum number of free machines.

  • Inexpensive scaling:
    As we designed a fully-scalable parallel encoding pipeline, it can be run in the cloud or directly on a developer’s notebook. We can easily adjust the count and performance of machines to minimize cost, and maximize the power and fine-tune by defining requested features as well.

  • Incorporating next-generation codecs:
    Encoding speed is still 20x slower. However, with parallelization, it’s not days but hours and can be improved by scaling up (adding more machines to the farm). In addition, we can distribute the encoding of variants of chunks to more machines, or change the chunk size, since our proposed parallel pipeline is fully-scalable horizontally.

Speed

The naive expectation was that the speed gain factor would be a linear function of the count of machines, like using two machines means it’s twice as fast, three machines is three times faster. and so on. But parallel processing requires sharing files between machines, as well as splitting/merging video to chunks and vice versa. All those parts are necessary overheads and every level of pipeline can utilize the different count of encoders. So what was the real speed gain?

Speed tests

Image 3. Encoding time of testing video by different pipeline configurations (lower is better).

The graph on Image 3 shows the performance statistics of our different configurations. Values are expressed relatively as multiples of the original video length. Using the approach of encoding all variants at once helps minimize overheads, and is why processing parallel pipelines via one encoder can perform better in comparison to the legacy serial pipeline (1st and 2nd rows in graph).

Increasing the encoder count actually does boost - almost linearly - up to a point, but it depends on the count of jobs generated for one video to process in parallel (2nd-4th rows), as well as on the overall count of jobs generated for more videos in the queue (5th row). Then, for a queue of 5 videos, the relative average time of encoding of one video is 0.46x / 5 = 0.09x. All of them are testing cases in ideal conditions.

After a month in production, the running numbers are as follows. Pipelines of “encode” type represent approx. 31% of all encoding pipelines, but consume more than 71% of overall computing time on average. After applying parallelization on those pipelines in November 2018, we managed to lower their overall computing time to 33% and relative average encoding time from ~3.9x to 1x total video length. The numbers for parallelization also calculate time spent by waiting for a free worker after a part of the pipeline was already done, therefore they are actually even lower than presented here.

Speed in production

Image 4. In November 2018, we switched to parallel processing of "encode" pipeline

Next Up

In the next piece in this series, we’ll give you a look under the hood, share some practical tips, and of course go through some issues we’ve had so far.

Please check the original version of this article at