Convolutional Neural Networks (Part 2)

In this section, we cover practical implementations of different convolutional deep neural networks, without covering each one, I find it particularly useful to point some of the patterns in most of the architectures:

Architectures tend to increase the amount of layers, and hence the amount of parameters required. That has been possible due to the expansion of multiple GPU usage in both research and industry.
The basic principles in layers remain, as most of the architectures start with convolution -activation -pooling series to end up with dense layers and ultimately a softmax for multiclass problems.
Many architectures obtain rather great results with very small and simple convolutions and pooling layers while keeping vast channels.

Resnets, or residual nets, and why do they work so well?

Resnets are based on blocks called residual blocks. This blocks avoid some intermediate inputs to go through the whole network before it can be used to layers deep to the right, like a shortcut.

The benefit from that networks is that with deeper neural networks it that the training error keep going down and does not explode or vanish. In a way this is making sure that something already learned can pass by and not go through a lot of initially random transformations. With sufficient training, the NN should be able to find that weights in between should be an identify matrix, but the residual matrix just make sure this is happening.

Why would a 1x1 convolution help my NN's!?

A 1x1 convolution takes the volume and apply a linear/nonlinear transformation that results in a similar size, being the amount of channels equal to the number of filters. This is very useful to shrink the nummer of channels, aggregating them properly. Again, this is more engineering than statistics, but it is very useful to model more complex functions or reduce as desired the number of channels.

Inception Networks, yet another password? Rather think as the bottleneck layer

For layers with many channels , for example 192, applying a 5x5x16 convolution will lead to >100 million parameters. If we instead apply a 1x1x16 convolution, we could reduce dramatically the amount of parameters , to ~12.5M, while still having a fairly large volume 28x28x32. This layer is called the bottleneck or inception layer, this allows to concatenate and save around of 90% of computation. Worth to know the trick!

Very deep neural networks, which are normally the ones hitting the top ranking in the competitions and benchmarks, concatenate blocs of fairly similar convolutional modules ( activation -> conv 1x1 -> conv 3x5 -> conv 5x5 -> channel concat). To avoid overfitting, some "gates out" are created to see the classification power at different layers to decide if going deeper is worth...

Mobilenet, another foundational convolutional NN? make them work in phones

Due to the limited computing power of phones, new architectures were required to benefit from that models more efficiently. Everywhere you would use an expensive convolution operation, you can instead you a depthwise convolution and the a stepwise computation, for several layers to come up with a much less parametric intensive architecture. To keep it short, think as this as a matrix multiplication procedure to reduce the dimension of the volum from which to keep learning, while having a fairly high information retrieval. See that residual connections where added to speed up backward propragation on top of the whole enchilada!

Efficientnet, can we further reduce parameters with little performance tradeoff?

How can we scale things up or down and what are the tradeoffs?

change resolution of the image -r
change the deepness of the network -d
change the width of the layers -w

Efficientnet was able to find a strong tradeoff between r,d and w while having state of the art performance. You can download the version that satisfy your memory constrains and accuracy goals.

https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.html

In order to minimize compute and achieve state of the art results, leveraging what other networks have learned i very important. In the next sectin we explain the concept of tranfer learning.

What is Transfer Learning? You should know it, you should do it!

Transfer learning is leveraging the learning of other neural networks, already train for your own problem. This is helpful because you can speed up getting good results, directly using the pre-trained models on similar data as yours or retraining using the weights from the original model. The latter is actually what leads to state of the art in many cases, particularly when your problem is different than the one from the pre-trained model.

So, in order to get fast, cheaper and better results, do not start from scratch, but rather leverage transfer learning to achieve the best for your problem with less effort and data.

How do you do it? normally you freeze mutliple layers and only train the softmax or last layer for the prediction of your classes or your regression problem. That would be the easier, and you only have to compute your feature vector based on the model, keep in mind that your input have to be adapted accordingly. Storing those feature vector in disk is good to speed up the computation at scale. If more customization is required, because performance is not as desired, unfreezing more layers can increase flexibility and therefore performance, at the cost of more computation time. The more data you have, the more promising that approach would be, specially when your data is different that the original data set. It is probably wise to start from the last layer and go backward based on performance, time and the goals of the project.

Data Augmentation, or how to augment your training data without more samples

Applying crops, rotation, noise, color augmentations, could increase your training data set and help to generalize the model better. The last models of self supervised learning such as SIMCLR and MOCO are good examples of the power of data augmentations to actually create a target variable (positive and negative samples). I will cover Moco in the next weeks.

Final Thoughts

One may ask why so much effort has been put on the architecture. The main reason is that getting more data, particuarly labeled data is very costly. With the raise and open sourcing of pre trained model, not having much data is no longer a barrier to have state of the art performance for your application with fewer data that the benchmarks. Note that also techniques such as augmentations and self supervised frameworks, can represent anything we want from an image, with 0 effort on labelling.

Alan Fortuny Sicart

Search This Blog