Computational Linear Algebra Techniques

Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!

Most of knowledge derived from this source.

Introduction

This section assumes you know foundational linear algebra, and we'll just emphasize a few key concepts crucial to intuition behind (mostly) deep learning techniques.

Measuring Size of Vector/Matrix: Norms

The L2 norm is used so frequently in machine learning that it is often denoted simply as ||x||, with the subscript 2 omitted. For computational convenience, we actually use the squared L2 norm, which can be calculated simply as x_transpose*x. However, a trade-off to recognize is that the norm increases really slowly near the origin, and making it difficult to discriminate between elements that are exactly zero and elements that are small but nonzero. In the case that this discrimination is important, we turn to the L1 norm.

With L1 norm, the function grows at the same rate at all locations.

Sometimes we may also wish to measure the size of a matrix. In the context of deep learning, the most common way to do this is with Frobenius norm, which is analogous to the L2 norm of a vector.

Special Properties: Orthogonality

A vector x and a vector y are orthogonal to each other if x_transpose*y= 0. This is a way to say that the two vector are independent of each other. If these vectors are also expressed in standard unit terms, then we can also call them orthonormal. We tend to favor orthonormal matrices because they are cheap to compute.

Eigen-decomposition

We can also decompose matrices in ways that show us information about their functional properties that is not obvious from the representation of the matrix as an array of elements. One of the most widely used kinds of matrix decomposition is called eigendecomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.

To explore this possibility, let's start with a foundation property: An eigenvector of a square matrix A is a nonzero vector v such that multiplication by A alters only the scale of v:

If we found a set of n independent eigenvectors (corresponding to n unique λ's) that belong to A, we can concatenate these eigenvectors into a square matrix V, such that eigendecomposition of A is then given by:

This is great! We've decomposed a matrix A -- however, an important fact is that not every real matrix A has a eigendecomposition 😢. But every real matrix does have a singular value decomposition (SVD). SVD is really similar to eigendecomposition, but this time we will write A as a product of three matrices:

Each of these matrices is deﬁned to have a special structure. The matrices U and V are both deﬁned to be orthogonal matrices. The matrix D is deﬁned to be a diagonal matrix (Note that D is not necessarily square).

Building off of SVD, we can find what's called a pseudoinverse:

We can now produce some really powerful practical properties:

When A has more columns than rows, then solving a linear equation using the pseudoinverse provides the solution x=A+y with minimal Euclidean norm||x||2 among all possible solutions.
When A has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the x for which Ax is as close as possible to y in terms of Euclidean norm ||Ax − y||.

Linear Dimensional Reduction

The determinant of a square matrix maps matrices to real value scalars that gives us insight into how much multiplication by the matrix expands or contracts space. If the determinant is 0, then space is contracted completely along at least one dimension, causing it to lose all its volume. If the determinant is 1, then the transformation preserves volume.

The determinant can be used to ultimately find eigenvalues of a matrix (via the characteristic equation). We can utilize eigenvectors and eigenvalues of a matrix to perform a principle component analysis (PCA), which is a linea

Matrix Decompostition/Matrix Factorization

SVD

Nonnegative matrix factorization (NMF)

Truncated SVD

Save a lot of computational time using only subset of columns of interest -- just interested in vectors corresponding to largest singular value

Stochastic Gradient Descent (SGD)

Applying SGD to NMF

PreviousOverview NextMachine Learning Heuristics

Last updated 6 years ago