State of Deep Learning in Computer Vision

Part I - ConvNet Architectures

This article is an extension of a talk I gave at the Czech Technical University (CTU) in July 2017. We surveyed around a hundred deep learning papers and selected the most interesting and important outcomes that will help you to understand the impact of deep learning in computer vision.

Our talk touched on three topics: novel architectures of Convolutional Neural Networks (ConvNets), the Attention Mechanism, and Video Classification. These are the most critical components of our deep learning system. We use these at the ShowmaxLab at CTU to better understand movies.

This is the first of two parts focused on ConvNets, the second one will cover the Attention.

ConvNet Architectures

Convolutional neural networks (CNN) can be used to extract informative features from images, eliminating the need of traditional manual image processing methods. The research in ConvNet architectures has come a long way since Alex Krizhevsky et al. demonstrated their power in the ImageNet 2012 challenge. The theme unifying a large portion of the research in ConvNets is an increase in depth. Deeper ConvNets are able to learn more complex mappings which in turn achieve better accuracies on a standard benchmark like ImageNet 2012.

Let’s start with the VGG architecture introduced in 2014 by researchers at the University of Oxford. Not only is it the most commonly used ConvNet, it has also greatly influenced the other architectures we will discuss.

schema of VGG-16 schema of VGG-16, source

VGG’s design follows three simple rules:

  1. Use 3x3 convolutional filters to increase the representational capacity of the ConvNet at the same number of parameters.
  2. Double the number of convolutional filters every time we halve the spatial resolution of the input to prevent information bottlenecks.
  3. Use Rectified Linear Units (ReLU) as an activation function to avoid the problem of saturation.

The Oxford researchers have demonstrated that they could train VGG with up to 19 convolutional layers, achieving 25.5% top-1 validation error on ImageNet 2012. Can we go deeper?

VGG (left) vs. ResNet (right) VGG (left) vs. ResNet (right), source

It turns out not too much. We run into the problem of degradation, that is the depth of the network hindering its learning, soon after the 19 layer mark. Fortunately, the problem can be fixed by introducing simple residual connections.

residual connections residual connections, source

Residual connections bring signals from lower layers to higher ones without changing it. This allows you to back-propagate errors through deeper networks. As you can see in the image above, we can design simple residual blocks and build a network from them.

An immensely successful architecture called Residual Network (ResNet) was introduced by Microsoft Research in 2015. The authors trained ResNet with up to 152 layers and achieved top-1 error of 21.3% on ImageNet.

Batch Normalization BN = Batch Normalization. (b) proposed = pre-activation version, source

The original ResNet applies a ReLU activation after adding the residual connection to the output. The ReLU activation inhibits learning in very deep ResNet, so Kaiming He, et al., proposed a simple fix. The network is restructured so that instead of applying an activation after convolutional layers, we apply a pre-activation (also ReLU) before convolutions, as illustrated in the image above.

The improved residual connections bring the ImageNet top-1 validation error down to 20.7% in ResNet-200 and allow us to successfully train Residual Networks with as many as 1000 convolutional layers.

So far, we have shown that residual connections greatly improve the training of ConvNets. However, ResNets use residual connections quite conservatively, bridging only every two or three convolutional layers. What if we push it to the extreme?

That is the idea behind Densely Connected Convolutional Networks (DenseNets). Similarly to ResNets, DenseNets are built from simple Dense Blocks that are repeated many times in the network.

In the Dense Block, the output of each layer is provided as an input to all subsequent layers. For instance, the output of the green layer in the image below is available to the purple, yellow and orange layers. The benefit of the dense connectivity is that each layer can focus on detecting very specific features of the image and doesn’t have to preserve all the information.

Dense Block Dense Block, source

While DenseNets show a lot of promise they are not used widely due to their naive implementations being expensive in terms of memory. However, this will very likely change and we will see more of them in the future.

Dense Network Dense Network, source

There are many more fascinating papers that propose various improvements to the ConvNet design. We find these particularly interesting:

The accuracy of modern ConvNets is matching, if not surpassing, human performance on difficult classification tasks that require understanding abstract concepts in images. There is still room for improvement, but ResNets have satisfactory performance for our purposes.

In ShowmaxLab, we research the use of Deep Learning for understanding movies. In order to understand the content of a video, a large number of frames needs to be processed. These frames contain a lot of redundant information which poses a challenge for neural networks.

Next, we present a component that can highlight important information in images and videos.

Please check the original version of this article at