Fundamental ML Models

Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!

Introduction

Linear regression

Logistics Regression

Reference for this subsection can be found here.

Logistic regression is a classification algorithm (e.g. binary classification). It's a variant of linear regression where dependent or output variable is categorical, i.e. it takes out of a few possible discrete values. Here's the underlying function:​​

What's really nice about logistic regression is that it is a differentiable function. Hence, we can train the machine learning model using gradient descent.

Great framework code can be found here.

K-Means Clustering

Reference for this subsection can be found here.

K-means clustering partitions the dataset into k clusters, where each instance is allocated to one cluster. Each cluster is represented by the cluster's center, also known as the centroid. It is a heuristic algorithm - results depend on initial actions (in this case, clusters).

Implementation

  1. Initialize K centroids

  2. For each data point, find the nearest centroid and classify the sample into the respective cluster

  3. Recalculate centroids as the arithmetic mean of all instances assigned to it

  4. Repeat steps 2-3 till convergence. The K final clusters are then obtained.

Optimization

Choosing good parameters

  1. Rule of thumb: K ≈ sqrt(N/2). By doing so, each cluster will ideally have sqrt(2N) points. (e.g. for a dataset of N=50 points, start with sqrt(50/2) = 5 clusters; of 10 points each)

  2. Elbow method: perform clustering for a range of parameter values, and calculate the total within-cluster sums-of-squares (WCSS) for each K. WCSS gives a measurement of the variance within each cluster. The lower the WCSS, the better the partition. As K increases, WCSS will monotonically decrease. Therefore, an optimal K can be selected from the range where the curve flattens down.

  3. Silhouette analysis: studying the separation distance between clusters. Silhouette coefficients measure how close are data points to the neighboring clusters and have a range of [-1, +1]. Points with a score near +1 are far away from neighboring clusters, whereas values around 0 indicate that samples are close to the boundaries between clusters. In addition, negative values might point out that instances are wrongly classified. Silhouette plots and averaged silhouette scores can be used to select an optimal K. A nice visual example with sklearn can be found here.

Some Applications

  • Image processing: image segmentation (image partitioning into clusters), color quantization (reducing the number of colors in an image)

  • Cluster analysis: segmenting users by spending behavior for e-commerce companies, user profiles grouping in search engines, patient clustering in healthcare

  • Feature learning: features extraction in images and documents

  • Anomaly detection: finding micro clusters formed by outliers

Limitations

K-Means aims to build a partition of the dataset into evenly spaced and well-shaped (ideally spherical) clusters of similar size. However, clusters might have unequal variance (left figure) or be of uneven size (right).

Powerful variant of K-Means: Soft Assignment

In soft assignment, each cluster represents a probability distribution (say gaussian, with some mean and variance), and each data point belongs to different clusters with different probabilities. These are updated iteratively in the same way as described above. This version of K-means is called soft assignment, and it is an example of the EM-algorithm.

Dimensionality Reduction and PCA

Dimensionality reduction is pretty self-explanatory: (in math-y terms) given an n-dimensional space, we're reducing that space in m dimensions, where m < n. For example, if our initial data described the 3-D work location of each person in SF, (x, y, z coordinates to describe each point) and we actually didn't find the height coordinate (z-coordinate) important, we are performing dimensionality reduction by getting rid of the z-coordinate of our data (now we have 2-D data).

What I just did in the example above is feature selection: focuses on finding a subset of the original attributes. Dimensionality reduction can also be performed via feature extraction -- transforming the original high-dimensional space into a lower dimensional one. Two main types of transformations are:

  • Linear transformations, which includes PCA and LDA

    • PCA finds the directions of maximal variance in high-dimensional data. By selecting only those axes that have the largest variance, PCA aims to capture the directions that contain the most information about the inputs, so we can express as much as possible with a minimal number of dimensions. The linearity of PCA, however, places significant limitations on the kinds of feature dimensions that can be extracted.

    • LDA is

  • Non-linear transformations, which includes t-SNE and autoencoders

    • t-SNE is

    • Autoencoders are

Some applications

Dimensional reduction is used a a powerful pre-processing step. For example:

  • Supervised: LDA + K-Nearest Neighbors

  • Unsupervised: PCA + K-Means clustering

Implementation

We'll go over three popular ways to implement dimensionality reduction: lasso regression, PCA, and ICA.

Lasso Regression (L1 regularization)

Since L1 regularization promotes sparsity, many of the estimated coefficients in the weight vector would be zero. Then, those features associated with non-zero coefficients would be kept. The higher the alpha parameter, the fewer features selected. However, L1 norm can lead to unstable results whenever features are highly correlated. For example by zeroing the coefficient of one out of two correlated features.

PCA

PCA is a popular feature extraction technique based on retaining the maximum variance of the data. The input space X is linearly transformed into a space X’ of uncorrelated components called the principal components:

​​

​​

The two main actions of PCA are:

  • Highly correlated features → uncorrelated features

  • If dimensionality reduction is desired, transformed data can be further projected into 1D along its largest principal component.

Independent Component Analysis (ICA)

In some cases, the orthogonality of the principal components prevent it from extracting informative features. For these cases independent component analysis (ICA) is a better choice. You can a great code walk-through here!

Important Notes/Reminders

PCA works better on data with linear relationships. For non-linear transformation, Kernel PCA can be applied for better results. In general, dimensionality reduction problems are approached by first applying a linear algorithm. Then, when results are not satisfactory, by trying non-linear techniques (autoencoders, etc).

Note that dimensionality reduction by means of Lasso regularization works by selecting features without modifying them (feature selection). In contrast to PCA, which transforms those original features into a low-dimensional space (feature extraction).

K-Nearest Neighbors

Lots of parallet concepts to K-Means Clustering but K-Means is used for unsupervised learning while K-Nearest Neighbors is used for supervised learning.

You can gain a more in-depth understanding here.

Support Vector Machine (SVM)

Reference for this subsection can be found here.

SVM is a really powerful, high-accuracy approach to classification. The two main types are binary classification, and multi-class SVM.

Binary Classification (most common)

SVM with gaussian kernel has been consistently shown to be one of the best machine learning models, achieving the highest accuracy in a large variety of datasets, specially datasets which do not involve images or audio.

In this problem context, given a binary classification dataset, an SVM aims to find a separating hyperplane between positive and negative instances. Although there are many possible hyperplanes that separate all data points correctly, the SVM algorithm seeks the hyperplane with the largest margin. That is, it maximizes the distance between the hyperplane and the nearest data point. Hence, it is known as a maximum-margin classifier. The closest data points are called support vectors (e.g. yellow points on the graph).

​​

Understanding the various hyper-parameters of an SVM are central to achieving high accuracy. Diving in:

First Hyperparameter (C): the SVM algorithm includes a regularization parameter (usually denoted as C), which controls how important it is to avoid misclassifying data points.

  • large C implies more correct classification

  • small values of C imply that more data points might be misclassified, but the resulting classifier will have a larger margin with the rest of the data points.

​​

In general, if the dataset has more outliers, we want to allow for more data points to be misclassified. In practice, we try many different values for the hyper-parameter C, and choose the one that performs best on the validation dataset.

Second Hyperparameter (when optimal classifier is not linear): It's a kernel-trick! Here, we're introducing new dimensions, or mapping the data to a higher dimension so that we can ultimately have a linear hyperplane that can be reduce back to original dimension to capture correct decision boundary. In practice,

  • SVM kernels provide a way to perform the transformation on the data implicitly. That is, we need not go through every data point and calculate the values for the new dimensions. For a restricted set of transformations, the SVM optimizer is able to implicitly work in such a high-dimensional space by looking at the dot products between pair of samples. This makes things a lot more computationally efficient when the number of higher dimensions is much much larger.

  • We have figured out a set of kernels which are extremely powerful and work very well on a large number of datasets. In particular, the polynomial kernel and gaussian kernel (also known as radial basis function) are the most popular ones.

Third hyperparameter (γ [gamma]): adjusting number of support vectors. This hyper-parameter, typically denoted by γ (gamma), controls how far the influence of each data point reaches. High values of γ imply a small influence and hence a small number of support vectors. Whereas low values of γ imply a large influence and hence a large number of support vectors.

Now, we're ready to introduce the second type of SVM: multi-class SVM (non-binary).

Multiclass SVM (non-binary)

This SVM tries to classify more than two types of classes. There are two popular options:

  1. One-versus-rest: train one SVM classifier per class to distinguish one single class from the rest. For predictions, we follow the winner-takes-all strategy. That is, new instances will be categorized based on the highest scoring output.

  2. One-versus-one: train one SVM classifier for each pair of classes. New samples are predicted following the max-wins voting strategy. That is, each classifier assigns the instance to one of the two classes, and the final prediction is given by the class with the most votes.

You can dive deeper into the application and theory here.

Naive Bayes

Reference for this subsection can be found here.

Naive Bayes is a supervised learning algorithm based on Bayes’ Theorem. The word naive comes from the assumption of independence among features. That is, if our input vector is (x1, x2,...,xn), then xi's are conditionally independent given y.

Variants of this approach includes: gaussian distribution (gaussian naive bayes), multinomial distribution (multinomial naive bayes), and bernoulli distribution (bernoulli naive bayes).

You can dive deeper into the application (text classification) and theory here.

Matrix Factorization and Recommendation Systems (Look into this more)

Reference for this subsection can be found here.

Recommendation systems tell customers what may be best suited to them.

  • Model parameters can be optimized via alternating least squares (first fixes U and optimizes X, then fixes X and optimizes U, and repeats until convergence.) or Stochastic Gradient Descent

  • Types: content-based recommendations, collaborative filtering via nearest neighbors that ask questions such as which other users are most similar to this user by defining a distance metric d(Ua, Ub) between user vectors, and similarly for movies. A commonly used distance metric is cosine distance.

Comparison to SVD

Singular Value Decomposition (SVD) is a method for finding the low-rank approximation of a matrix, which is what matrix factorization is also doing. However, SVD doesn't work on incomplete matrices. Decomposing a matrix with missing values is equivalent to setting the missing values to 0, which will lead to undesirable results.

Last updated