Convolutional Neural Networks (Part 1)

A great deal of the perception of our world come from our vision, and it could be fair to state than a great deal of communication depends of visual language too. The digitalization of many business makes product and services images one of the key touchpoints with consumers, and hence, a very important data point to manage our business and take decisions as consumers. Computer vision, is the science of making computers to process vision very much in the same way as us.

In the blogpost, I am going to summarize the main learnings I got from the deep learning specialization, and more concretely, from the convolutional neural network course from https://www.deeplearning.ai.

Understanding convolutions

Convolutions are are the core of many computer vision algorithms. The idea is simple, in order to identify key feature such as vertical or horizontal edges, we can multiply with a filer/convolution each pixel to identify such edges. The following image shows how we identify the edges with the 3x3 convolution for vertical edge detection. Note that each cell in the green blox (3x3 submatrix top left of the 6x6 matrix) is multiply with the corresponding cell from the convolution. Note that the resulting matrix have 0 for flat areas and 30 for the edges. It's like finding the ridges and valleys in the alps : )

Depending on the task or goal we would use different convolutions or filters, some of which are relatively old but still very effective, such as the canny edge detector: https://medium.com/codex/sobel-vs-canny-edge-detection-techniques-step-by-step-implementation-11ae6103a56a

Note that the resulting image has a different dimension than the original. The reason is that the filter could not be applied fully on the corners of the images. In order to avoid that, at the lack of underrepresentation of corners, we can use padding to keep output size and represent more uniformly edges. Padding creates and artificial border to make sure the input and output are consistent in size. By convention this padding contains 0's. Most ML/DL frameworks contains padding functionalities,we need to specify if we want the "same" output size or the "valid", meaning the one without padding.

If you want to look further than the next pixel, by using striding. The only think to keep in mind is that striding affect the resulting output matrix size, making it smaller the bigger the stride.

Images are 3d array, 1 for each color channel. Convolutions can be done for each color channel to detect edges in one or the other, or the aggregation of all of them. the result will ultimately be aggregated to as many output layers as different convolutions applied:

Understanding Layers

There are layers such as pooling layers, that instead of taking every single pixel, they pick up the max or average of a neighboor of pixels. That seems to provide better results and more robust representation, despite a clear theoretical basis they have worked fairly well. As pooling is always a sort of aggregator, that reduces input size and hence reduces computational time in the next layers.

An example: LeNet -5

The image below show some of the pioneer architectures on a 32x32x3 image. We started with a convolution 5X5 that results on a 28x28x6 volume. We perform later max pooling with parameters f= s= 2 that results in a 14x14x6 volume. We applied another 5x5 convolution and later another max pooling layer to finally get a 5x5x16 volume. We stacked this volume into a 400 length vector.

The stacked vector can be used to classified images, to asses image similarity and many other applications, such as product recommender systems or features for ML engines. Not that we can reduce the size of that vector adding fully connected layers with smaller hidden units to compress that info to 120,84...or even 16 length vectors.

How to put this thing together? this a more engineering than statistical method, and therefore experimentation is key. The experts recommend to study successful architectures and leverage them on similar problems as ours. Here comes the art as a critical piece of deep learning!

Why using convolutional layers versus fully connected layers?

Note that the amount of parameters to train with full connected layers is order of magnitudes bigger than convoluted layers. This is probably the first and main reason why you want to go from the 14M parameters of a full connected layer applied on an 28x28x3 image (normally we work with 255x255x3), versus a few houndreds of a convolutional layer. Convolutions create features such as edges and face parts which could share parameters across multiple locations of the image, and therefore require less parameters.

Please check the following course: https://www.deeplearning.ai/program/deep-learning-specialization/

Alan Fortuny Sicart

Search This Blog