Every day at Showmax, we deliver hundreds of thousands of hours of streaming movies and series to users in more than 70 countries.
To make that experience as good as it can possibly be, we use a variety of machine learning tools to mine useful information. Our deep neural networks process sound and video separately, working to detect and describe specific scenes in the movie or show. For now, the focus is on detecting objects, actions, and locations and then using that metadata for categorization and our recommendation engine.
Basically, we’re building the ability to “understand” the content of each movie algorithmically. This is always challenging: it is not just about recognizing which objects are important in images, it’s also about detecting actions and combining them in higher level stories that render the context, scene, and action in a useful way.
In the last few years, the progress in machine learning and artificial intelligence methods for image and sound processing has been phenomenal. Traditional pattern recognition methods developed over the last 40 years have been rendered largely obsolete by deep learning neural networks. However, the design and implementation of deep learning is part science and part art, and a process of continuous trial and error.
To improve the design and training of convolutional neural networks, one must perfectly understand the way they function. We find it effective to visualize the content of each layer in the network to support their design and tuning. For instance, we have been using this framework extensively to guide the design of new 3D convolutional networks for better processing of Showmax video content.
Thanks to the visualizations, we were able to eliminate networks that were overfitted to the training data, and designed new architectures with diverse filters that are more robust and better for both transfer and meta learning. Another benefit is that you can glean biases in datasets from feature visualizations. We trained a network to recognize various activities of people and found that the network focused on objects in the background instead of the actual people, for example.
Now, we are open-sourcing a visualization library called Conveiro (from the Greek όνειρο, “dream”) that allows users to analyze convolutional neural networks of various architectures using several techniques.
Visualization of 2D convolutional neural networks
2D convolutional neural networks typically process video frames downscaled to 224 pixels (or smaller). RGB frames are processed independently, through a sequence of layers where pooling layers reduce dimensionality and convolutional layers detect features.
The advantage of convolutional neural networks over classical filters and detectors is that convolutional filters are trainable. The processing of real-size images would require too many weights; therefore, each filter (neuron) has just a small reception field which slightly overlaps for filters in one layer.
To further reduce the number of parameters needed, weights can be shared across some architectures. Note that, whereas filters in first layers process small regions of the image and detect mostly edges and corners, further down the layers, neurons use already-preprocessed information to detect textures or simple parts of objects. Deep in the network, one can post-process information by fully-connected layers trained by error backpropagation.
The error here is the difference from required output. One can, for example, try to separate images with a car from images without cars. Every time the network outputs the wrong prediction, the error is back-propagated to update weights and improve the prediction. It works surprisingly well, especially when massive data sets of positive and negative examples are used for training. Take a breath, though. The generalization capability of such networks can be quite limited.
When analyzing a neural network visually, you first have to choose a particular filter within the network. Then, you can inspect which images from the dataset activate your filter the most (left figure). Or you can generate a so-called deep dream image. Deep dream images are generated from random Gaussian noise using gradient ascent. This works by modifying pixels in the source image in order to increase the output activation of the selected filter.
However, it is necessary to bias the optimization process to favor structure (low-frequency features) instead of details (high-frequency features).
Color-Decorrelated Fourier Space Visualization
Conveiro Library currently contains Deep Dream and our custom implementation of feature visualisation called CDFS (Color-Decorrelated Fourier Space)
You can use our tools to visualize several common ConvNet architectures, such as VGG, ResNet, and Inception out of the box. If you want to explore your own custom architecture, you can easily plug it into our framework and visualize individual convolutional filters.
The algorithm for generating CDFS images consists of the following steps:
- Apply data augmentation to the image. This step prevents various unwanted artifacts that the convolutional networks have a tendency to generate;
- Parameterize image by Fourier coefficients. Optimizing the image in the Fourier space allows for an easy control of the frequencies of sinusoids the image is decomposed into;
- Scale coefficients by their frequencies. Convolutional networks favor high frequencies, which results in an image that is hard to understand for a human. Instead, we want to favor low frequencies of sinusoids that lead to a more perceived structure in the generated image;
- Convert coefficients back into an image using inverse Fourier transformation;
- Decorrelate colors in the image. The Fourier space messes up the colors in the image, but Google researchers found that decorrelating the RGB channels for each pixel mitigates this problem;
- Optimize Fourier coefficients by a gradient descent algorithm. Finally, we pass the image as an input to the convolutional network and optimize the Fourier coefficients with respect to the filter we want to visualize.
The steps are illustrated in the following diagram:
When observing a CDFS visualization for two specific filters during the training, the network vision sharpens and gets more sensitive to details over time (as seen below).
Example: Inspecting ResNet-50
ResNet-50 is a 50-layer residual network. Residual networks allow the construction of very deep networks without hitting the typical problems of network depth (vanishing/exploding gradients). The trick is to carry over the input signal to deeper layers of the network and to combine it with the features of that layer as inputs to the next layer. This “trick” allows the networks to reach high image-recognition accuracy.
Equipped with the visualization tool, we can investigate the behavior of ResNet-50 - a rather complex convolutional neural network with the following structure:
A deep convolutional network detects abstract concepts (such as “a person” or “a house”) incrementally. First, the input image is decomposed into corners, edges, and other low-level features (scale 2 and 3). The low-level features are then composed into shapes and parts of objects (scale 3 and 4). Finally, the object primitives are connected together to detect a whole object, which usually happens at the end of the network (scale 5).
In the examples below, we generated a CDFS image and displayed 9 top images from the training set that maximize the probability that the filter is activated. We selected four filters from each segment of the network and display visualizations obtained using our visualization library.
For the beginning of the network (scale 2), it is apparent that CDFS visualizations are mostly images with heavy edges and corners. Also, the top images that activate filters are fences and regular structures with contrasting, repetitive patterns.
For scale 3, these are more natural patterns and the top images are quite consistent for all selected filters.
Scale 4 visualizations are a bit more complex, but still very natural and consistent.
For scale 5, visualization filters get surprisingly cluttered.
Thanks to our library, we were able to inspect the behavior of pre-trained deep convolutional neural networks like ResNet and Inception Net.
Here is a visual summary of the ResNet inspection using our library:
Our research into new convolutional network architectures involves the training of these networks on large datasets. These visualizations helps us ensure that filters are able to generalize well at all scales of image processing.
Below are some examples of poorly trained filters from our experiments:
From our experiments, it looks like the overfitted network contains a lot of ‘empty’ filters (one is in the examples). These filters are actually not empty, but they are sensitive to random noise in the training data, which our method cannot visualize because of the explicit penalization of high frequencies in our visualization tool.
Here are some takeaways, clearly observable using our tool:
- Convolutional layers build an understanding of complex patterns like “hairiness” or “curviness” by combining simpler patterns. The lowest-level features - corners and edges - facilitate the detection of textures and primitive shapes, which lead to the detection of whole objects.
- Filters in residual networks get progressively more detailed as the depth of the network increases. This partially explains why networks with hundreds of layers achieve higher performance.
- We observed semantic grouping in the last few convolutional layers (e.g. we found a “spaghetti detector” filter). The final convolutional filters are tuned for specific objects and their detections are combined by the final fully-connected layer to form the prediction.
We started the implementation of the library before Lucid library was released, and it took us quite a bit of time to replicate the papers and fine tune the implementation. Our immediate goal is to extend the library to cover 3D convolutional neural networks that are capable of processing sequences of images (video) at once. In addition, we want to be able to visualize the attention mechanism to further improve our understanding of how the neural network functions. Once the algorithms are developed, we plan to contribute them to the Lucid library as well.
Note: The following people contributed to this article and to the conveiro library: Ondřej Bíža, Pavel Kordík, Antonín Říha, Adam Činčura and Jan Pipek. This work is part of ShowmaxLab.