CNNs: An Overview

Source knowledge for this subsection can be found here.

Problem Space

For image classification, computers taken in an n x n x 3 image and decides what the image represents. For background, CNN is modelled like neurons in the cortex, where certain cells are activated by seeing certain objects/orientations, and these neurons are ultimately organized in columnar architecture -- This idea of specialized components inside of a system having specific tasks is the basis behind CNN.

Structure

Take the input (e.g. image), pass it through series of convolutional, nonlinear, pooling (downsampling), and fully connected layers, to get an output (format: can be a single class or a probability of classes that best describes the image)

First Layer

Always a convolutional layer - let’s imagine this flashlight sliding across all the areas of the input image (e.g. 35 x 35 x 3). In machine learning terms, this flashlight is called a filter(or sometimes referred to as a neuron or a kernel) and the region that it is shining over is called the receptive field. Now this filter is also an array of numbers (the numbers are called weights or parameters). A very important note is that the depth of this filter has to be the same as the depth of the input. As the filter is sliding, or convolving, around the input image, it is multiplying the values in the filter with the original pixel values of the image (aka computing element wise multiplications) and then summed to get a single number.Now, we repeat this process for every location on the input volume. Every unique location on the input volume produces a number. left with is a 28 x 28 x 1 array of numbers, which we call an activation map or feature map

Let’s say now we use two 5 x 5 x 3 filters instead of one. Then our output volume would be 28 x 28 x 2. By using more filters, we are able to preserve the spatial dimensions better.

Each of these filters can be thought of as feature identifiers (e.g. straight edges, simple colors, curves).

The more filters, the greater the depth of the activation map, and the more information we have about the input volume. As you go through the network and go through more conv layers, you get activation maps that represent more and more complex features.

A classic CNN architecture would look like this.

A fully connected layer to the end of the network outputs an N dimensional vector where N is the number of classes that the program has to choose from

E.g. if you wanted a digit classification program, N would be 10 since there are 10 digits. Each number in this N dimensional vector represents the probability of a certain class. For example, if the resulting vector for a digit classification program is [0 .1 .1 .75 0 0 0 0 0 .05] so 75% chance image is a 3

Training

The process of forward pass, loss function, backward pass, and parameter update is one training iteration. The program will repeat this process for a fixed number of iterations for each set of training images (commonly called a batch). Once you finish the parameter update on the last training example, hopefully the network should be trained well enough so that the weights of the layers are tuned correctly.

Finally, to see whether or not our CNN works, we have a different set of images and labels (can’t double dip between training and test!) and pass the images through the CNN. We compare the outputs to the ground truth and see if our network works!

Hyperparameters

Filter size
Stride: The amount by which the filter shifts by controlling how the filter convolves around the input volume. Stride is normally set in a way so that the output volume is an integer and not a fraction.
Padding: add additional number “paddings” (usually zero) (e.g. used to maintain output volume as same size as input volume.

The formula for calculating the output size for any given conv layer is

where O is the output height/length, W is the input height/length, K is the filter size, P is the padding, and S is the stride. One way to think about how to choose the hyperparameters is to find the right combination that creates abstractions of the image at a proper scale.

Pooling Layers

After some ReLU layers, programmers may choose to apply a pooling layer. Most popular atm: maxpooling -- basically takes a filter (normally of size 2x2) and a stride of the same length. It then applies it to the input volume and outputs the maximum number in every subregion that the filter convolves around.

Other options for pooling layers are average pooling and L2-norm pooling. The intuitive reasoning behind this layer is that once we know that a specific feature is in the original input volume (there will be a high activation value), its exact location is not as important as its relative location to the other features.

This serves two main purposes. The first is that the amount of parameters or weights is reduced by 75%, thus lessening the computation cost. The second is that it will control overfitting.

Dropout Layers

This layer “drops out” a random set of activations in that layer by setting them to zero. forces the network to be redundant. By that I mean the network should be able to provide the right classification or output for a specific example even if some of the activations are dropped out. It makes sure that the network isn’t getting too “fitted” to the training data and thus helps alleviate the overfitting problem. An important note is that this layer is only used during training.

Network in Network Layers

Network in network layer refers to a conv layer where a 1 x 1 size filter is used. these 1x1 convolutions span a certain depth, so we can think of it as a 1 x 1 x N convolution where N is the number of filters applied in the layer. Effectively, this layer is performing a N-D element-wise multiplication where N is the depth of the input volume into the layer.

Output: Classification, Localization, Detection, Segmentation

image classification - taking an input image and outputting a class number out of a set of categories.
object localization - produce a class label + bounding box that describes where the object is in the picture.
object detection - where localization needs to be done on all of the objects in the image.
object segmentation - output a class label as well as an outline of every object in the input image.

PreviousYou Don't Have Enough DATA NextRNNs: An Overview

Last updated 6 years ago