tales-science-data
tales-science-data copied to clipboard
Additions to convolutional ANN
From the notebook I had (part of this is already in the existing page, you need to check what isn't):
Convolutional neural networks (CNNs) are networks that contain some layers (convolutional layers) which are not fully connected, that is, where not every neuron is connected to every other one in the previous layer and they apply a convolution operation to the input data. A typical CNN architecture is made of a convolutional layers base plus some other types of layers.
These types of networks are particularly suited for working on images data as you can map regions of the image to specific neurons. Their architecture is indeed specifically designed to deal with image input data, and in fact neurons are arranged to follow the geometrical distributions of images data: there are the width and height dimensions, plus the one for the colour channel.
The reason to use convolutional networks (whose specifics follow here) to deal with images is that "regular" fully connected nets would not replicate the geometry of the input data and would, more importantly, need too many parameters to be trained with success in a typical case. Also, we know that artificial neural networks have been inspired by biology; CNNs in particular have been conceived out of an inspiration from the visual cortex, where different neurons respond to different regions of the input space (see receptive fields below) and that visual neurons are organised in a matrix format, see the [experiments] carried out by Hubel and Wiesel in the '50s and '60s where they have demonstrated this.
CNNs learn to recognise subsequent hierarchical levels of shapes in the image, eventually managing to distinguish between, say, the image of a dog and that of a cat.
CNNs' architectures are built as sequences of convolutional layer followed by pooling layer (see below), plus a fully connected layer at the end. There are variations as to what can go in between and so to how many convolutional layers are stacked and how they are alternated with other types or layers. The final fully connected layer is the one responsible for the final result.
Snippets of history
For a bit of history, have a read at the seminal [paper] by LeCun et. al and watch this video about the first CNN (LeNet, by LeCun) trained to recognise handwritten digits back n 1993, it's quite funny. This network has been utilised by the USA postal system to automatically read ZIP codes, in the 1990s!
Also, this page collects a chronology of all the networks built to classify images, on various standard dataset, such as MNIST of CIFAR. The reality is, since deep learning became a thing at scale starting from the middle of the 2000's, we are assisting to a breakthrough in the history of science and technology in general, because these days CNNs are allowing for the realisation of tasks making for a new "summer of AI", which many people more qualified than be believe is this time here to stay.
How do CNNs work
Convolution and local receptive fields
The convolution operation that these types of networks apply to the input data consists in the fact that each neuron has the task of dealing with a specific region of the input. In a typical convolutional structure, input neurons take the pixels of the image and hidden neurons are such that each of them only communicates to a region of the input image. These regions shift when you move from one neuron to the next; this is the concept of local receptive fields, brilliantly illustrated in chapter 6 of Nielsen's [book], where the following images are taken from. Specifically, these images illustrate the concept via the example of a 28X28 input image (it's the classic example of the handwritten digits recognition from the MNIST dataset) and a 24X24 first hidden layer where each neuron deals with a region 5X5 (receptive field).


Passing over the input image in shifting local regions means building the combination of input pixels and weights via a filter, or kernel, which is a matrix of weights the size of the receptive field. The filter is applied to the receptive field of image pixels to build the linear combination, which will then be argument of the activation function as per usual. The linear combination is built element-wise: each pixel is multiplied to its corresponding filter value and then all factors are summed up. This whole procedure is the essence of convolution, the name refers to the shifting of the filter one pixel at a time. The output of a convolutional layer is called a feature map.
This means that for an image of shape $n \times n$ (imagine this is a matrix, so the image is black and white; in the case of a coloured one you'd have the third dimension for the colour space), and using a kernel $m \times m$, with $m < n$, via this operation you end up with the output of the application of the filter which is $(n - m + 1) \times (n - m + 1)$, so dimensionality is reduced.
We have illustrate the concept here with the assumption that the kernel will shift by one pixel every time. That's the typical situation, but there can be situations where you want the kernel to shift by a higher number of pixels, a quantity that is called the stride. The other parameter is the padding. We've seen that the application of a convolutional layer reduces the dimensionality of the data; if this is best to be avoided you can set some padding of chosen size, that is, add some 0 pixels around the input matrix borders so that effectively you're passing a bigger image. See these excellent blogs [6] on this.
What is the effect of convolution on the input image: shape detections
With the application of convolution, the network learns features of the input image. Features are the shapes in it. If you have a shape in an image, say a sequence of pixels forming the shape of a curve, by applying the filter to it and via the sum of the multiplications it performs on the image, you will have large numbers when the weights in the filter are in the same locations of the curve beneath: the filter activates. If on the other hand the applied filter does not find a match between its weights and the pixels beneath, the value of the sum of multiplications will be small. This is how the network recognises shapes.
The feature map in output will contain these values and will then tell you which areas correspond to the passed filter: the filter is acting as a shape detector.
In a typical case, you'd apply multiple filters to the input image, so to recognise different kinds of shapes: one for straight vertical lines, one for curves of a certain curvature, one for horizontal lines, etc. The feature map in output is a composite object with a dimension (depth) for the filters applied, so that each slice of it is the result of the application of one of the filters. The numbers of filters applied, i.e., its depth, constitutes the channels of the feature map.
Learning hyerarchically
Once you have the feature map from a convolutional layer, with its shape, that includes the channels for the filters you have applied, the network will have learned the shapes determined by those filters. In a usual structure you'd have a sequence of convolutional layers because you want to make the network learn more and more complicated shapes. This is the reason behind building deep convolutional networks.
The first convolutional layer learns simple features, like straight lines and curves. The feature map that outputs from it gets passed as input to the second convolutional layer, which then takes these learned simple features (simple shapes) and does the same procedure, hence learning more complex shapes as a result. From little straight and curved lines to broader shapes. The third layer builds on top of these learned shapes, and so on. Effectively at each convolutional step (layer) you are training the network to recognise things in the image that are hyerarchically more complicated, till the end when it will have learned to recognise a dog, say, as the result of its many many shapes, from small to broad.

Shared weights (and biases)
In a CNN, the parameters in a filter (weights/biases) are shared by all neurons in the same hidden layer, meaning that the kernel is the same for all neurons in the layer. So each depth slice of the feature map will have been built using the same weights.
This is a way to prevent the number of parameters learned to explode. Also, this makes it so that shapes are learned independently of where they are located in the image.
The typical structure of a CNN
The typical structure of a CNN encompasses some initial convolutional layers, plus some other layers depending on the problem it's set to solve. Besides, in a typical setup convolutional layers are alternated with pooling layers to reduce dimensionality.
Pooling layers
Pooling layers have the function of reducing the information/dimensionality the preceding convolutional layer outputs and there can be of different sorts. In a max-pooling layer with 2X2 receptive field for instance (a usual suspect), the output of the convolutional layer is shrunk into a smaller matrix where each 2X2 region of pixels is reduced to one pixel with their maximum.
Pooling is applied to the feature map. This procedure greatly reduces dimensionality, hence the number of parameters, keeping only the relevant information and discarding precision.
Max-pooling is a very common type of pooling but there are other ones, like average-pooling, where the average gets spit instead of the maximum, or $L_2$-pooling, where the square root of the sum of squares is spit.
Through a pooling layer, an input $n \times n$ with a pooling kernel $2 \times 2$ will lead to an output $n/2 \times n/2$.
Dropout layers
Dropout layers have the function to control for overfitting by silencing a chosen percentage of the neurons, choosing which ones at random. They're typically places as the very last layers in the network. The technique is employed to ensure more robustness in the learning as dropping out some neurons makes it so that potential correlations in the learning of neurons are destroyed, effectively controlling for the network adapting to noise.
Other layers
Other layers you attach at the top pf the network (after the convolutional base) would be those customised for the problem you want to solve. For instance, in the case of a classification with 3 classes, you'd put a fully connected layer that takes the output of the convolutional base and spits a 3-dimentional output representing the probabilities of the classifications in each of the classes.
References
- M Nielsen, Neural Networks and Deep Learning, chapter 6
- Convolutional Neural Netoworks for visual recognition, Stanford CS class by A Karpathy and F F Li
- Y LeCun, L Bottou, Y Bengio, P Haffner, Gradient-based learning applied to document recognition, Proc. of the IEEE, 1998
- A page on the experiment by Huber and Wiesel
- A beginner's guide to understanding convolutional neural networks, part 1 and part 2, blogs by A Deshpande
- An interactive demo of convolution