FastAI deeplearning Part 14.2: Resnets Deep Dive

In the previous post, I cover in detail convolutional neural networks and the importance of batchnorm in order to achieve stable training process and faster learning. In the following, I will deep dive intothat issue:

"In some convolutional net implementations, deeper models have lead to higher training and validation error, despite of the theoretical possibility of being at least as good as a shalower model"

To overcome this, the authors of this paper , propose a solution that is since 2015 widely use in any convolutional neural network, which is the concept of residual nets or resnets, and the skip connections. Let's get started!

Resnets

Resnets appear after the observation that some deeper models where providing worse training and test error, which is not expected. The authors of the Resnet paper show how even if the two networks have the same weights on the same layers, the addition of other layers and its update through SGD lead to lower performance, which should not be as the model should find an identity weight in the case that new layers add not value to the task.

The proposed solution from the authors is that such additional layers could focus on trying to minimize the difference between the actual and the predicted activation from past layers, as it is expected that SGD is most likely to push the error down as to find the exact identity matrix (new weights equal to 1).

For the training of Resnets, we do not need to train two separate neural networks, but rather add skip connections moving the learned activation further in the network.

That arrow on the right is just the x part of x+conv2(conv1(x)), and is known as the identity branch or skip connection. The path on the left is the conv2(conv1(x)) part. You can think of the identity path as providing a direct route from the input to the output.

There is one challenge that needs to be overcomed, which is the change in output size due to the addition of the identity branch or skip connection. To overcome that, we can use average pooling layer to get the desired output size. We want this skip connection to be as close to an identity map as possible, which means making the convolution as simple as possible. That is a convolution is one with kernel size 1 which is only doing a dot product over the channels of each input pixel.Note that the relu activation will be used after we stack the identity matrix. By doing this we will manage to train deeper networks without facing the higher error rates showed in the experiments of the Resnet paper.

How Skip connections support the training process

In the following paper, several visualization approaches are tested to understand better which architectures lead to better behaved loss functions. This allows us to understand why the usage of skip connections will help us to train successfully deeper networks. The below image show how stunningly smooth the loss function became with the usage of skip connections.

State of the art Resnets

To finish off, we will briefly explain how the state of the art in Resnets have been achieved. First, based on the insight that most of the computation happen at the early layers and most of the parameters are in the last layers, we will use in the early layers plain convolutions an only skip connections later. In that way we keep the first layers as fast as possible with no compromise in performance.

The second trick refers to the usage of bottleneck layers, where 1x1 convolutions are added and the beginning and the end of the layer to speed computation, as more filters are use for the same original number of channels.

For a deep dive of both tricks please refer to : bag of tricks paper.

Link to the repo: notebook

Appendix: Fully Convolutional Networks

Let's clarify an important layer of modern convolutional networks. Note that in order to have a neural network that work for different input sizes but that ensures that its provides a feature map of certain size, we need to provide a consolidated view. In order to do that, we will use fully convolutional layers, which simply apply an aggregated operation, such as the mean of the maximum, of a grid of activations.

In that way we can ensure that independent of the input size we get the desired output size. One note to take into account is that this operation is not desirable on problems where the exact placing of a part of an object is essential for its understanding, such as digit detection. For the majority of natural images, fully connected layers with average pooling is the common practice.

This is how a simple model with a fully connected layer looks like in Pytorch:

def get_model():
    return nn.Sequential(
        block(3, 16),
        block(16, 32),
        block(32, 64),
        block(64, 128),
        block(128, 256),
        nn.AdaptiveAvgPool2d(1),
        Flatten(),
        nn.Linear(256, dls.c))

With this in mind, we can train a model with different image sizes and also do not need to worry on the amount of layers and striding to get for a certain output size. 

Alan Fortuny Sicart

Search This Blog