Machine Learning Heuristics
Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!
Last updated
Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!
Last updated
A large amount of knowledge from this section comes from this source.
In this section, we'll introduce major concepts of machine learning:
cost functions
optimization
overfitting, cross-validation, regularization
Once you have a good intuition for these concepts, you'll be able to understanding the bigger picture -- the machine learning process to develop a good model:
The learning process: representation + evaluation + optimization
Cost functions act to measure that given certain weights of an equation, how close the y is to true y’s (i.e. each point in a dimension is called a “residual”)
E.g., for linear regression, we have mean square error cost function (average of squared errors)
Given a machine learning model with parameters (weights and biases) and a cost function to evaluate how good a particular model is, our learning problem reduces to that of finding a good set of weights for our model which minimizes the cost function.
Thus, the process of finding the best model out of the many possible models is achieved by minimizing the cost function -- optimization. Gradient descent is one of most common optimization algorithms. It is an iterative method and a few variants of this concept in action includes:
Batch gradient descent: calculate the gradients for the whole training dataset to perform one parameter update, batch gradient descent can be very slow.
Stochastic gradient descent: computes the gradient for each update using a single training data point xi (chosen at random). The idea is the gradient calculated this way is a stochastic approximation to the gradient calculated using the entire training data.
Mini-batch gradient descent: we calculate the gradient for each small mini-batch of training data. Usually mini-batch GD is used because computing infrastructure - compilers, CPUs, GPUs - are often optimized for performing vector additions and vector multiplications.
Note: If you're using a novel machine learning algorithm and applying the gradient descent optimization technique on it, it's important to have at-least a rough idea of what the gradients look like in-order to make sure that the method works as desired and does not get stuck / behave erratically.
The DREAM -- The ability to perform well on unseen data is called generalization, and is the desirable characteristic we want in a model.
In a really common case when a model performs well on training data (the data on which the algorithm was trained) but does not perform well on test data (new or unseen data), we have what's called overfitting. Smaller datasets tend to be the biggest culprits of overfitting because we don't enough data to represent the real-world phenomena we're attempting to model.
But have no fear! One powerful way to avoid overfitting is to include a hyperparameter, an additional term in our cost function to attempt to loosen how strongly our model may be learning the training data -- this process is called regularization.
Regularization: controlling the model's complexity, i.e. by introducing an additional term in our cost function in-order to penalize large weights. This biases our model to be simpler, where simpler is weights of smaller magnitude (or even zero). We want to make the weights smaller, because complex models and overfitting are characterized by large weights. Some variants include:
L2 Regularization (Ridge Regularization):
For L1 Regularization (Lasso Regularization), we have ||w|| instead of ||w2||.
An interesting property of L1 regularization is that model's parameters become sparse during optimization, i.e. it promotes a larger number of parameters w to be zero because smaller weights are equally penalized as larger weights. It might help us identify which features are more important for making predictions, or it might help us reduce the size of a model (the zero values don't need to be stored).
We can pick a regularization process, but how do we know it's the best one to use or if it's actually an improvement to the overall modelling process? To double-check if we chose the best hyperparameter for our regularization process, it's best practice to perform cross-validation -- a process to validate that we’re improving the accuracy of fitting model as robustly as possible. Some methods of cross-validation include:
Holdout method: we save a subset of training set for validation (check whether or not our model is getting better on the validation dataset during optimization process). A solid type is:
K-fold CV: dataset is divided into k separate parts. We repeat training process k times. Each time, one part is used as validation data, and the rest is used for training a model. Then we average the error to evaluate a model. Note that k-fold cross validation increases the computational requirements for training our model by a factor of k.
Leave-one-out cross validation is a special instance of k-fold cross validation in which k is equal to the number of data points in the dataset. Each time, we hold out a single data point and train a model on rest of the data. We use the single data point to test our model. Then we calculate the average error to evaluate a model.
Note: More robust than holdout method, and computationally less intensive for smaller subsets of data.
Good to know: Ensemble Methods are good learners in the case of small dataset. In ensemble methods, instead of training one model we train multiple models and average the predictions.
Now that we've gone over major aspects in the machine learning process, let's talk about the three essential questions for finding the best model for any particular problem at hand:
Representation: What the model looks like?
Evaluation: How do we differentiate good models from bad ones?
What is our process or algorithm for finding the good models among all the possible models?
Choosing a representation of a learner is tantamount to choosing the set of classifiers that it can possibly learn. This set is called hypothesis space. (e.g. linear regression is a linear function model)
Some key considerations: Is the scenario you are trying to capture well represented by the model function? Is it overly restrictive? Is it overly flexible? For example, if the data has a quadratic trend, but we are trying to fit a linear function to it, we are being overly restrictive.
An evaluation or objective function (or cost function) is needed to distinguish good classifiers from bad ones. (e.g. In the case of least-squares linear regression, the cost function was mean-squared error cost function.)
Some key considerations: Does your cost function capture the relative importance of different kinds of mistakes? For example, is being off by 0.3 for one data point and 0.1 for another data points better or worse than being off by 0.2 for both data points? Is a false positive as bad as a false negative?
Optimization -- what is our process or algorithm for finding the good models among all the possible models in the hypothesis space? (e.g. Gradient descent is an optimization algorithm that can be used for finding the optimal model for least-squares linear regression.)
Some key considerations: How efficient is the optimization technique in practice? Does it always find the optimal solution? Is it possible for it to output sub-optimal solutions? If yes, how often does it happen in practice?
Looking at the table above, these are just some terms you might not be familiar with:
“precision and recall”: In pattern recognition, information retrieval and binary classification, precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. Both precision and recall are therefore based on an understanding and measure of relevance.
Information gain: In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree. Usually an attribute with high mutual information should be preferred to other attributes.
K-L Divergence: Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution diverges from a second, expected probability distribution.[1][2] Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread. In the simple case, a Kullback–Leibler divergence of 0 indicates that we can expect similar, if not the same, behavior of two different distributions, while a Kullback–Leibler divergence of 1 indicates that the two distributions behave in such a different manner that the expectation given the first distribution approaches zero. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning.
Margin: of a single data point is defined to be the distance from the data point to a decision boundary. Note that there are many distances and decision boundaries that may be appropriate for certain datasets and goals. A margin classifier is a classifier that explicitly utilizes the margin of each example while learning a classifier.There are theoretical justifications (based on the VC dimension) as to why maximizing the margin (under some suitable constraints) may be beneficial for machine learning and statistical inferences algorithms.
E.g. where larger values of λ imply more regularization, i.e. smaller values for the model parameters. ||w2|| is w12 + w22 + ... wd2.