Divide & Encode: How to Encode Videos Blazingly Fast

Part II - Under The Hood

This is the second (and final) part of our blog mini-series about boosting encoding speed. In the first part we wrote about how we managed to maximize the power of our encoding farm via parallelization of the encoding pipeline. Here, we will give you a look under the hood of our implementation, and the issues we encountered along the way.

Implementation has two main parts. First, we took the original pipeline in the encoder service and split it in to small, logical units - this helps keep things simple and reliable in the processing pipeline. You will appreciate this approach when things go wrong. Bugs strike and you must debug a lot very quickly. The second part, the heavy lifting of the correct definition of the parallel pipelines, and the scheduling of respective jobs, is left on the API. It takes dependencies between jobs, priorities, and also jobs’ and workers’ features into account when assigning jobs to the individual encoder machines. To share files between workers we use GlusterFS.

Architecture of parallel encoder Image 1. Architecture of parallel encoder.

Our implementation allows us to run on a physical machine (e.g. a server or developer’s laptop), or in the cloud (e.g. a Docker container). It is designed generically so that even our Analytics Team can use it for their deep learning computations on movies and series. The scene-detection algorithm uses multiple neural networks to detect what is happening in the video, and runs on dedicated GPU-equipped machines, but, from the global perspective, it behaves like any other encoder task.

Under the hood

Initially, we were looking for some existing scheduling and enqueueing tools, like Celery or Apache Airflow. tl;dr, we probably suffer from not-invented-here (NIH) syndrome a little bit. We considered fitting our demands to existing schedulers harder than building a custom solution, so we implemented our own simple scheduler which is incorporated into our Content Management System (CMS). Before the parallel encoding pipeline, our CMS already provided our Content Team with basic tools for video editing, cropping, trimming, and setting up external audio tracks.

We designed the scheduler to support the following features:

  • Simple notation of the pipeline: Pipeline is programmatically defined and dynamically changeable by video properties per individual encoding tasks. See the code snippet below. All classes ending with Recipe hold the configurations of jobs and methods for root_node, spread_children_node, and collect_children_node generates nodes and edges between them.

    def generate(self, **kwargs):
        r"""Generate parallel encoding pipeline.
    
        Brief diagram of pipeline:
             split_segment
            /   || \   \
        images [*chunks] [*audios]
         |    \|/     //
         |  [*merge_store]
         |      \|/
         | [*drm_store]
         |  |//
         clean
        """
        # Setup split segment job.
        ...
        split_segment_recipe = SplitSegmentRecipe(**split_segment_opts)
        split_segment_node = self.root_node(recipe=split_segment_recipe)
        # Setup audio encode jobs.
        ...
        # Setup video chunk encode jobs.
        ...
        for video_stream_opts_group in encode_video_chunk_groups:
           for i in range(chunk_range_start, chunk_range_end + 1):
               ...
               encode_chunk_recipe = EncodeVideoChunkRecipe(
                   **encode_video_opts
               )
               encode_chunk_recipes.append(encode_chunk_recipe)
    
        encode_chunk_nodes = self.spread_children_node(
           parent=split_segment_node, recipes=encode_chunk_recipes)
    
        # Setup image encode job.
        ...
        # Setup merge encode jobs.
        for streams_opts in merge_mux_stream_groups:
           ...
           merge_mux_recipe = MergeMuxStoreRecipe(
               **merge_mux_opts_group
           )
           merge_mux_node = self.collect_children_node(
               parents=merge_mux_recipe.filter_nodes(
                   encode_audio_nodes + encode_chunk_nodes),
               recipe=merge_mux_recipe)
           merge_mux_nodes.append(merge_mux_node)
    
        # Setup DRM jobs.
        ...
    
        # Setup clean job.
        ...
        clean_recipe = CleanRecipe(
           **clean_opts
        )
        self.collect_children_node(
           parents=drm_store_nodes + [image_node], recipe=clean_recipe)
    
    
  • Advanced scheduling: The scheduler reflects the queue prioritization and other specific requirements for the job (e.g. requires GPU, etc.) when it is acquired by the concrete machine. In other words, we can target a specific job on a specific set of workers.

  • Simple management control: The management tool should be able disable and enable workers, monitor the progress of job processing (via heartbeats), cancel processing prematurely if some job of the pipeline fails (again via heartbeats), or recover/restart from the job where the failure occurred. Certainly, it expects that the worker knows how to accept and process the responses from the API (eg. heartbeats) properly.

Performance debugging

Our CMS is built on Django, so it was easy to customize the admin views. We created two interactive widgets. The first one is for displaying relationships between different jobs of the encoding task in the form of a directed acyclic graph (DAG), shown in Image 2. It is very useful when checking the pipeline definition.

Interactive DAG Image 2. Interactive display of encoding pipeline.

For the latter, in Image 3, we used Gantt graph (again interactive). This helps us debug and improve the performance of the encoding pipeline.

Interactive DAG Image 3. Processing of job of one encoding task with gap when all encoders were busy.

Logging

At Showmax, we log everything. We have a fully-fashioned logging infrastructure using ELK stack. Your logging process should be defined so you can monitor your software as easily as possible. It’s easy for us to monitor the processing of the encoding task with more encoders - and in real time - in Kibana (see Image 4).

Kibana DAG Image 4. Logs in Kibana.

Watch for state

In order to observe the actual state of encoding remotely, or by some external utility, FFmpeg provides an option: -progress.

$ ffmpeg -progress progress.log -i ...
-progress url (global)
	Send program-friendly progress information to url.

Progress information is written approximately every second, and at the end of the encoding process. It is made of “key=value” lines made up of alphanumeric characters. The last key of a sequence of progress information is always “progress”. (FFmpeg)

Then, without affecting the encoder application, you can easily watch encoding speed and progress like this:

$ tail -f progress.log | egrep "(speed|out_time)="
out_time=00:00:01.706667
speed= 3.4x
out_time=00:00:03.221333
speed=3.18x
out_time=00:00:04.458667
speed=2.95x
out_time=00:00:05.162667
speed=2.54x

I personally use it often in both the developing and testing phase, and when investigating issues in production.

New software, new bugs, and new fails

Unfortunately, not everything went perfectly. We hit two major issues, both related to processing segmented video. Let’s repeat briefly. In the first step, the video is segmented to chunks with a length divisible by group of pictures (GoP) of mezzanine file, every chunk is encoded by one worker, and, after processing, all of the chunks are merged to the final video.

Cutting issue

A basic example of GoP of a mezzanine file is one-second, and chunks encoded by one encoder are 10 seconds long. In Image 5., you can see the encoding without trimming, which produces the video correctly with a regular GoP.

Cutting issue 1 Image 5. Without trimming.

In Image 6., trimming is all set up. Only the first and the last chunk is encoded, as trimmed and middle chunks are encoded in full length as before. The video has irregular GoP because the start cut is out of the key frame.

Cutting issue 2 Image 6. Trimmed video with irregular GoP.

frame|key_frame=1|pkt_pts_time=5.000000
frame|key_frame=1|pkt_pts_time=6.000000
frame|key_frame=1|pkt_pts_time=7.600000     <---- irregularity
frame|key_frame=1|pkt_pts_time=8.600000
frame|key_frame=1|pkt_pts_time=9.600000

To prevent that, all encoders must produce 10-second chunks (except the last one, of course), as shown in Image 7. Therefore, to encode one chunk, the encoder has to get more segments and cut only specific 10-second clips of video off of segments. In other words, the encoders will process overlapping segments and make a precise cut-off while encoding. In order to mitigate overlap and transferred video segments, we split video to smaller parts (for brevity, one second) at the splitting step.

In fact, we split to 10-seconds segments, which are encoded into chunks of minimum one-minute, with a maximum count of chunks-per-video of 50, as we currently have 50 workers.

Cutting issue 3 Image 7. Trimmed with regular GoP using overlapped segments.

But wait, there’s more. Start/end cuts must be aligned on frames, otherwise FFmpeg aligns it auto-magically in order to keep the desired length of chunk, but the frame count does not have to correspond the duration. After concatenating to the final MP4 file, the duration can actually be shorter.

Short disclaimer about our solution and Showmax mezzanine files: We could start creating segments from the farthest-left I-frame of the start cut to minimize overlapping of segments to one second (the frequency of I-frames in our mezzanine file). But, we plan to move the segmentation step to the phase where the mezzanine file is created (before our parallel encoding pipeline). We must also admit that there are other possibilities to specify a mezzanine file with DNxHR codec, or encoding with I-only frames in order to mitigate (or completely avoid) that cutting issue. However, years ago, we made a compromise and tried to find the balance between the size and flexibility of a mezzanine file. The best option for us is H.264 (high profile and level 5.2), with bitrate of 16Mbps, and a regular closed GoP. It gives us higher transfer speed and storage capacity for mezzanine files, and minimal quality loss.

Merging issue

This issue was a bit of a mystery. During the QA process, we found that some Nexus TV devices, and older Android devices with Widevine DRM, failed to play…sometimes. The Widevine DRM error message said:

W/MDRMOemCrypto: duplicate SPS NALU (skipping)
E/MDRMOemCrypto: ERROR: multiple PPS NALUs (not supported by video decoder)
E/MDRMOemCrypto: Failed to process subsamples
E/WVCdm: Decrypt error result in session sid43 during encrypted block: 178

Respectively:

Player error (videoId: 290ebde0-e14e-46e2-bc16-7a56ec844fdb; currentPosition: 5344000; error: error (UNKNOWN, Type: 260, Extra: -1004))

Further investigation revealed that the issue occurs only when using DRM when streams are switched. The multi-bitrate/resolution stream without DRM plays just as well as the Single-bitrate/resolution stream without DRM. After more Googling, we found a very familiar issue in exoplayer.

We started experimenting with various FFmpeg flags when splitting and merging videos, but eventually it helped to replace FFmpeg with the mkvmerge command in order to merge chunks to one video properly. It is also worthy to mention that we are using Matroska container format for all intermediate steps in the pipeline like splitting audio and video tracks, transcoding and etc. We struggled with tricky A/V sync issue years ago and MKV file format helped us with keeping A/V in-sync.

Conclusion

Let’s recap. Horizontal scaling is very powerful way how to boost the performance of video encoding software, even if you don’t have high-end hardware. We managed to lower encoding time significantly by using a simple, custom-built scheduler. That allows us to run encoding workers in different environments, such as the cloud or on a farm of physical machines. Moreover, it’s applicable to any costly computations which can be parallelized and distributed.

There are, however, some constraints in this approach to parallel video encoding. The input/mezzanine file must have regular and closed GoP in order to be able split video to smaller segments properly so they can be encoded by more workers. But, if you do it correctly, you will rejoice.

Please check the original version of this article at