CNNs: An Overview

Source knowledge for this subsection can be found here.

Problem Space

For image classification, computers taken in an n x n x 3 image and decides what the image represents. For background, CNN is modelled like neurons in the cortex, where certain cells are activated by seeing certain objects/orientations, and these neurons are ultimately organized in columnar architecture -- This idea of specialized components inside of a system having specific tasks is the basis behind CNN.

Structure

Take the input (e.g. image), pass it through series of convolutional, nonlinear, pooling (downsampling), and fully connected layers, to get an output (format: can be a single class or a probability of classes that best describes the image)

First Layer

Let’s say now we use two 5 x 5 x 3 filters instead of one. Then our output volume would be 28 x 28 x 2. By using more filters, we are able to preserve the spatial dimensions better.

The more filters, the greater the depth of the activation map, and the more information we have about the input volume. As you go through the network and go through more conv layers, you get activation maps that represent more and more complex features.

A fully connected layer to the end of the network outputs an N dimensional vector where N is the number of classes that the program has to choose from

  • E.g. if you wanted a digit classification program, N would be 10 since there are 10 digits. Each number in this N dimensional vector represents the probability of a certain class. For example, if the resulting vector for a digit classification program is [0 .1 .1 .75 0 0 0 0 0 .05] so 75% chance image is a 3

Training

The process of forward pass, loss function, backward pass, and parameter update is one training iteration. The program will repeat this process for a fixed number of iterations for each set of training images (commonly called a batch). Once you finish the parameter update on the last training example, hopefully the network should be trained well enough so that the weights of the layers are tuned correctly.

Finally, to see whether or not our CNN works, we have a different set of images and labels (can’t double dip between training and test!) and pass the images through the CNN. We compare the outputs to the ground truth and see if our network works!

Hyperparameters

  • Filter size

  • Stride: The amount by which the filter shifts by controlling how the filter convolves around the input volume. Stride is normally set in a way so that the output volume is an integer and not a fraction.

  • Padding: add additional number “paddings” (usually zero) (e.g. used to maintain output volume as same size as input volume.

where O is the output height/length, W is the input height/length, K is the filter size, P is the padding, and S is the stride. One way to think about how to choose the hyperparameters is to find the right combination that creates abstractions of the image at a proper scale.

Pooling Layers

Other options for pooling layers are average pooling and L2-norm pooling. The intuitive reasoning behind this layer is that once we know that a specific feature is in the original input volume (there will be a high activation value), its exact location is not as important as its relative location to the other features.

  • This serves two main purposes. The first is that the amount of parameters or weights is reduced by 75%, thus lessening the computation cost. The second is that it will control overfitting.

Dropout Layers

This layer “drops out” a random set of activations in that layer by setting them to zero. forces the network to be redundant. By that I mean the network should be able to provide the right classification or output for a specific example even if some of the activations are dropped out. It makes sure that the network isn’t getting too “fitted” to the training data and thus helps alleviate the overfitting problem. An important note is that this layer is only used during training.

Network in Network Layers

Network in network layer refers to a conv layer where a 1 x 1 size filter is used. these 1x1 convolutions span a certain depth, so we can think of it as a 1 x 1 x N convolution where N is the number of filters applied in the layer. Effectively, this layer is performing a N-D element-wise multiplication where N is the depth of the input volume into the layer.

Output: Classification, Localization, Detection, Segmentation

  • image classification - taking an input image and outputting a class number out of a set of categories.

  • object localization - produce a class label + bounding box that describes where the object is in the picture.

  • object detection - where localization needs to be done on all of the objects in the image.

  • object segmentation - output a class label as well as an outline of every object in the input image.

Last updated