Deep Learning Fundamentals
Last updated
Last updated
Sources for this section can be reference from Commonlounge and one of Andrej's online lectures.
Deep learning is actually a subfield under machine learning. However, what I was talking about before was traditional machine learning (composed of foundational linear relationships). The main difference between deep learning and traditional machine learning is that deep learning models have a notion of multiple layers or multiple levels of hierarchy, which opens up the possibility being able to learn models for more complicated tasks.
Deep learning architectures are designed with multiple layers with the intuition that the lower to higher layers will automatically learn to model lower to higher level of abstractions (e.g. computer vision classification of a cat: starting in this case from pixels and going all the way to the animal.) Another good example of this type of compositionality is books are made of chapters, chapters are made of paragraphs, paragraphs are made of sentences, sentences are made of words, words are made of characters.
One of the biggest issues in traditional ML is feature extraction -- in DL, We can think of the lower layers as doing performing automatic feature extraction, requiring little guidance from the programmer.
Each neuron has a set of inputs, each of which is given a specific weight. The neuron computes some function on these weighted inputs. A linear neuron takes a linear combination of the weighted inputs. A sigmoidal neuron feeds the weighted sum of the inputs into the logistic function, which results in a value between 0 and 1.
When the weighted sum is very negative, the return value is very close to 0. When the weighted sum is very large and positive, the return value is very close to 1. The logistic function is important because it introduces a non-linearity, and this is important to enable the neural network to learn more complex models. In the absence of these non-linear functions (called activation functions), the entire neural network would be a linear function, and the layers would not be useful.
Regardless of the activation function you choose, we begin building the network when we start connecting the input data to the neurons, neurons to each other, and neurons to the output layer. A really simple structure follows below!
The layers of neurons that lie sandwiched between the first layer of neurons (input layer) and the last layer of neurons (output layer), are called hidden layers. This is where most of the magic is happening when the neural net tries to solve problems. Taking a closer look at the activities of hidden layers can tell us a lot about the features the network has learned to extract from the data.
Also, note that it is not required that a neuron has its outlet connected to the inputs of every neuron in the next layer. Different architectures of neural networks are obtained by selecting which neurons to connect to which other neurons in the next layer. The greater the number of layers, the more wiggle room in the model (the more crazy computations can occur). So naturally, we should consider a regularizer. Do not use size of layers as regularizers; use a stronger regularizer instead (for smoother/ not complex functions).
Here are some additional important notes to keep in mind:
Every layer does not need to have the same number of neurons.
The inputs and outputs are vectorized representations.
For example, you might imagine a neural network where the inputs are the individual pixel RGB values in an image represented as a vector. The last layer might have 2 neurons which correspond to the answer to our problem: [0,1] if the image contains a dog, [1,0] if the image contains a cat, [0,0] if it contains neither, and [1,1] if it contains both.
We'll dive more into popular architectures in the next section!
You can use any optimization method you'd like, but a popular method to note is mini-batch SGD, an iterative loop like so:
Sample a batch of data
Forward prop it through graph, get loss
Backprop to calculate the gradients
Update the parameters using the gradient
It tends to be good practice to normalize your data (except for image data because pixels are already normalized). If you decide to do this, here are a few methods:
zero-center data: subtracting from the mean
PCA: decorrelate data (data has diagonal covariance matrix)
Whitened data: covariance matrix is identity matrix
To train a neural network, we use the iterative gradient descent method. That is, we start with random initialization of the weights. And then repeatedly make predictions on some subset of the data (forward-pass), calculate the corresponding cost function C, and update each weight w by an amount proportional to dC/dw, i.e. the derivative of the cost functions w.r.t. the weight. The proportionality constant is known as the learning rate.
The gradients can be calculated efficiently using the back-propagation algorithm. The key observation of backprop is that because of the chain rule of differentiation, the gradient at each neuron in the neural network can be calculated using the gradient at the neurons it has outgoing edges to. Hence, we calculate the gradients backwards, i.e. first calculate the gradients of the output layer.
A backprop code & walk-through can be found here.
Back-propagation tends to be visualized as a computational graph:
In the context of deep learning, we're still utilizing foundational machine learning heuristics (Learning = Representation + Evaluation + Optimization) we discussed in a previous section -- the representation in this case is automatically determined by the deep learning model, strongly dicated by the deep learning framework (also called architecture) chosen. Evaluation occurs via calculating the cost function, the output obtained via feeding inputs through the computational graph/circuit (a series of functions) until we get a number at the end. Finally, optimization in deep learning is achieved by performing back-propagation.
I really recommend learning more about back-propagation here, as it will help you gain an intuition for the deep learning frameworks you create and how to fix your model when learning doesn't go according to plan. The learning process will almost definitely not go according to plan. 👌
We want an activation function that optimizes backprop and avoids creating (too many) dead neurons.
So, we talked a bit about a sigmoid activation function, but in reality, no one really uses (or is recommended to use) it anymore. We've recently discovered more efficient AND effective activation functions. Let's dive into the types and the classic pros/cons list.
A dead neuron is a neuron that can't be activated -- this event can occur when our training rate is too high and we initalized with an unlucky set of weights. When we have a dead neuron, we can't back-prop through that neuron.Squashes numbers to range [0,1]
historically popular since they have a nice interpretation as saturating “firing rate” of a neuron
PROBLEMS:
Saturated neurons “kill” the gradients: During backprop, “local gradient” multiplied by prev gradient. If the input value is very negative/very positive, then local gradient is basically 0 because slope at those points is zero -- imagine network of sigmoid neurons that are in a saturated regime (either zero or one), then gradients can’t back-propagate through network.
Sigmoid outputs are not zero-centered: When you preprocess your data, you want to make sure they are zero-centric. However if we use a large amount of sigmoid neurons, the layers are linearly stacked and non-zero centric data ([0,1]), then we observe slower convergence because the gradients on w are always all positive or all negative. Take-home: You want zero-centric things in input, you want zero-centric things throughout.
Performing exp() is a bit compute expensive (minor compared to the dot-product)
an attempt to fix sigmoid (only that it is zero-centered; otherwise same problems as above), like two sigmoids put together)
Squashes numbers to range [-1, 1]
Zero-centered (nice)
Still kills gradient when saturated :(
Computes f(x) = max(0,x)
During backprop, if positive, then slope/gradient = 1, allowing the value through, otherwise kills it
Pros:
Does not saturate (in +region) so not as many backprop of 0
Very computational efficient
Converges much faster than sigmoid/tanh in practice (e.g. 6x)
PROBLEMS:
Non-zero centered output
An annoyance: if we have an inactive neuron because value returned is 0, then during backprop, it kills the gradient or undefined
In practice, if you initialize the neurons in a very unlucky way (automatically kills neurons that may have been useful; cant back prop || if your training, and training rate is high then as all neurons jitter around, some neurons can get knocked off the data training and never be activated again)
Note: Potential solution is to initialize ReLU neurons with slightly positive biases (e.g. 0.01) ⭐
f(x) = max(0.01x, x)
Pros:
Same pros as a normal ReLU, except “will not die”
Pros:
all the benefits of ReLU
doesn't die
closer to zero mean outputs
Cons:
computation requires exp()
nonlinearity -- does not have the basic form of dot product
Pros:
generalizes ReLU and Leaky ReLU
Linear regime! Doesn't saturate! Doesn't die!
Cons:
doubles the number of parameters/neurons
Use ReLU. Be careful with your learning rates.
Try out Leaky ReLU / Maxout / ELU
Try out tanh but don't expect much
Don't use sigmoid
If you want to build a really simple neural network from scratch, this is a great tutorial.