The Showmax Engineering story, Part I.

Here’s a rundown of the main terms we use at Showmax Engineering

Showmax delivers a top-shelf VOD service across the African continent. On top of Hollywood hits, TV series, and local telenovelas, we offer news, music and live sports.

Here on the Showmax tech blog, we write about the many challenges that we encounter with media engineering, improving UX, managing peaks during live events, facing infrastructure, hackers, and much (much) more. Our posts are typically pretty in-depth, and we often use terms we see as basic, but may need proper explanation.

So, we’ve put together a list of terms and definitions for the core part of our business — media engineering. Here, we start with basics, but we will get in to more complex terms and topics in part II of this series.

The basic terms of Showmax Engineering

Over the top (OTT) - When the content is transferred to viewers over the internet.

Video on demand (VOD) - A service that allows users to access video they want to watch directly without sticking to the TV guide schedule as known from regular TV broadcasting.

VOD business models:

Advertising-based (AVOD) - Access is usually free for customers and the service is supported by advertisers.
Subscription (SVOD) - Access is behind paywall and customers pay for monthly, weekly, or daily access. Subscribers can consume as much content as they want — this is what we focus on at Showmax.
Transactional (TVOD) - Access is behind paywall, but it’s pay-per-view, where customers pay for access to particular pieces of content or set amount of time.
Premium (PVOD) - Premium video on demand is a version of TVOD, but the price is much higher and the rules are more strict. You can watch the movie at the same time as it’s being aired in US cinemas, but you can watch it only once.

Encoding & Decoding

Accessing content and making content available for users is the core of media engineering. What do we do with the media to get it ready for transfer and playing on whatever player? We encode it and prepare it to be decoded.

Video & Audio codecs/formats are algorithms that encode or decode a digital data stream or signal. While encoders compress video data (e.g. to transfer), the decoders do the opposite, they uncompress data (e.g. to display). That’s why you have an encoder in your camera and decoder in your TV. The most popular video codec is H.264/AVC1, but it’s being replaced by its successor, H.265/HEVC, and royalty-free VP9 is being replaced by its successor, AV1. Audio codecs include Opus, the family of AAC codecs (AAC-LC, HE-AAC), or the line of Dolby codecs like Dolby Digital/AC-3, Dolby Digital Plus/E-AC-3, or Dolby Atmos/AC-4.
File Formats/Containers are like shells that wrap encoded video, audio and other data into one file. That data is also called “tracks” within the container, and the process of creating/changing the container is called “transmuxing.” Popular containers are:

  • MPEG-TS used in broadcasting for streaming
  • MPEG-4 popular for streaming video over the internet and storing video files (in all of its variants, MP4, FMP4, MOV)
  • MKV, a versatile format used mostly for distribution of whole video files, supporting vast amount of audio and video codecs while being royalty free
  • WEBM which is quickly gaining popularity as a format for video content on websites, replacing GIF images.

example video container An example of video container with multiple audio and subtitle tracks.

Bit rate is the amount of bits used for coding video/audio data within one second (bps). It determines what throughput or bandwidth is required to stream the video in a particular quality on the network. More bits preserve more information from the signal, fewer mean that the information from the signal is lost. The quality of codecs is measured by how well they compress, in other words, how much data is able to store/compress into a given amount of bits. Better encoding compression requires more computational power; the same is often true for decoding.

Resolution is the amount of pixels in X and Y dimension used for coding frames in video. There are several standardized resolutions based on height:

  • Standard Definition = SD = 576p
  • High Definition = HD = 720p
  • FullHD = FHD = 1080p
  • 4K Ultra High Definition = 4K UHD = 3840p
  • 8K Ultra High Definition = 8K UHD = 7680p

Resolution and bit rate are often used when describing quality. The ratio between width and height is called aspect ratio, commonly expressed as two numbers separated by a colon — most-commonly 4:3 and 16:9. We recognize 3 kinds of aspect ratio:

  • Display aspect ratio (DAR)
  • Storage aspect ratio (SAR)
  • Pixel aspect ratio (PAR)

The relationship within them can be expressed as PAR = DAR/SAR.
The encoded video data actually consists of a series of compressed still images called frames. Codecs encode either as the whole image (I-frames), with reference to the previous images (P-frames) to be decoded, or the previous and subsequent images (B-frames). I-frames require the most bits, and B-frames lowest. The amount of consecutive I/B/P frames may impact things like overall visual quality and seeking speed. A lack of I-frames may cause corruptly displayed frames like gray areas after seeking, but it usually heals itself once it hits an I-frame and redraws the whole display.

Maximising encoding power with scaling, and minimising encoding time with an in-house scheduler

The Showmax Media Engineering team maximises the performance of our encoding software by using horizontal scaling (read how). The system is powerful even if the encoding runs on hardware that isn’t necessarily high-end. Even if we use a server or a developer’s laptop, the output remains the same.

A parallel encoding pipeline goes hand-in-hand with encoding split into small, logical units — a nice thing to have when things go wrong. Encoding jobs in parallel pipelines are scheduled by our very own scheduler using the API. We considered using solutions like Celery or Apache Airflow, but concluded that it would be harder to fit these to our needs — reading the existing schedulers — than to just build our own custom solution.

In the Content Management System, we incorporated our own simple scheduler that helps us reduce encoding time significantly. To share files between the encoders, we use GlusterFS.

To learn more, and to get a wider perspective on media engineering at Showmax, wait for part II. of this vocabulary exercise. It will focus on decoding, live streaming, data saving, bitrates and more.

More frequently used terms in media encoding

Frame rate (measured as frames per second; FPS) defines how many images are compressed in one second, and then displayed in that frequency. It is usually expressed as a number, but can also be expressed as a fraction, e.g. 25/1. Higher FPS means a smoother overall video, but also larger video files and higher bitrates. Modern video standards are usually 24, 25, and 30 FPS, 23.97, 24.97, and 29.97 respectively. High values like 120 are ideal for slow motion — they can be slowed down 4x to 30 FPS and playback is still smooth.

Interlacing is a technique of visually doubling the frame rate by inserting two consecutive frames into one. This is usually applied when it’s beneficial to have a higher frame rate, but there are space and bandwidth limitations. However, decoded interlaced frames are visually very unpleasant and have to be deinterlaced. Most modern codecs are not optimized for interlaced scan but rather progressive scan, since most video files are transmitted and stored digitally.

Group of Pictures (GoP), expressed as a whole number, is a group of ordered inter-frames and intra-frames. Each GoP starts with intra-frame, I-frame, followed by bipredictive and predictive frames. Even though GoP is usually expressed as one number, it can sometimes be described as two numbers — the distance between two intra-frames and the distance between the intra-iframe and predictive frame. A group of pictures can be either opened or closed. Predictive and bi-predictive frames within an opened GoP can reference frames from other GoPs, but predictive and bi-predictive frames within a closed GoP can only reference frames within the same GoP. This is important, especially for streaming video with adaptive bitrate, because streaming with open GoPs can lead to issues with referencing invalid frames and the video can be distorted, hurting UX.

Open and closed GoPs A difference between opened and closed group of pictures. Notice the P-Frame referencing the B-Frame from previous GoP.

Color space is an organization of colors and their combinations. Depending on the application, we have to choose from two models: subtractive and additive. Using the subtractive model, a color is removed from key color to increase the level of light that can pass through. Using the additive model, a color is added to key color to increase the level of light passing through or reflecting. Colors are analog values, so we have to use a discrete representation of combinations to store colors digitally. The most common color spaces used today are the RGB additive model and CMYK subtractive model.

Lip syncing is the technique of matching audio of people speaking or singing to actual video of those people. The most common form of lip syncing is dubbing or voice-overs. In video streaming, non-matching lip syncing is a sign of uneven audio and video alignment due for various reasons, like slow decoding of the video stream, or frame drops that render the video behind audio.

Decoding timestamp (DTS) tells us when we need to decode a frame. For example, we have a sequence of bi-predictive frames referencing a predictive frame in the future. To be able to decode these B-frames, we have to decode the P-frame first.

Presentation timestamp (PTS) expresses the actual time when a decoded frame should be displayed. This is often different from DTS.

Are you interested in Media Engineering, join us we are hiring! Check out the open position in our team.

Please check the original version of this article at