State of Deep Learning in Computer Vision

Part II - The Attention Mechanism

This article is an extension of a talk I gave at the Czech Technical University (CTU) in July 2017. We surveyed around a hundred deep learning papers and selected the most interesting and important outcomes that will help you to understand the impact of deep learning in computer vision.

Our talk touched on three topics: novel architectures of Convolutional Neural Networks (ConvNets), the Attention Mechanism, and Video Classification. These are the most critical components of our deep learning system. We use these at the ShowmaxLab at CTU to better understand movies.

This is the second of two parts focused on Attention. If you have not already, please check the first article about ConvNet Architectures.

The Attention Mechanism

The attention mechanism plays a vital role in human cognition. Consequently, there were many attempts to replicate it first in Computational Neuroscience and later in Machine Learning.

More recently, attention-based models have seen huge success in Natural Language Processing followed by several successful applications in Deep Learning models for Computer Vision.

In Show, Attend and Tell (K. Xu et al.) an attention-based neural network is used to generate captions for images. The images are first processed by a standard ConvNet and a Recurrent Neural Network (RNN) with attention is then run for several steps to generate the text. The RNN is enhanced with the ability to decide on which part of an image to focus next.

Pictured below, you can see how the neural network generated captions. The white halo suggest which parts of the image it attends to.

an example of Sequential Attention an example of Sequential Attention, source

The attention mechanism brings two important benefits to the system. First and foremost, allowing the network to manage its focus improves the quality of the generated captions. For instance, the network can first focus on the subject of the picture and then shift its attention to the background. Additionally, it adds a layer of interpretability much needed in modern Deep Learning systems.

Show, Attend and Tell employs attention in the form of a sequence of decisions made by a Recurrent Neural Network. However, for the majority of Computer Vision tasks, ConvNets are preferred. As a consequence, it is beneficial to explore ConvNets enhanced with attention.

Residual Attention Networks (ResNetAttention) are a successful iteration of a ConvNet architecture involving an attention mechanism. Based on the ResNet we introduced above, ResNetAttention is extended with an Attention Module. Through the bottleneck design, the attention module can reason about large sections of an image and decide which region to attend to. Furthermore, instead of having a single centralized attention, the module is inserted at various depths in the network. Consequently, the ConvNet can use low-level as well as high-level information to highlight important regions of an image.

Inner workings of the Attention Module Inner workings of the Attention Module, source

Above, the inner workings of the Attention Module are visualized. The module outputs soft attention masks that can assign varying importance to different parts of the image.

Finally, we include a comparison of vanilla ResNet and a ResNet enhanced with attention in the table below. ResNetAttention with 56 convolutional layers (Attention-56) beats ResNet with 152 convolutional layers (ResNet-152) while requiring only half the parameters and computations. The Residual Attention Module was included in the winning submission for ImageNet 2017 object detection challenge.

comparison of ResNet and ResNetAttention comparison of ResNet and ResNetAttention, source

In summary, there have been many improvements to the basic ConvNet architecture including Residual Connections and the Attention Module in the past few years. They allow you to train powerful classifiers while keeping the number of parameters and computations low.

Ultimately, advancements in image classification are worth pursuing as they (back)propagate through the whole field of Computer Vision and foster many new exciting applications.

In ShowmaxLab at CTU, we currently develop state-of-the-art solutions based on the methods described in this article to improve video understanding. If you’d like to work with us, or learn more, contact us at

Please check the original version of this article at