Research to the People
  • What is Research to the People?
  • About the Data
    • What Data Do We Work With?
    • Recommended: External Data Sources
  • Hacking on the Cloud
    • Getting Set-up on Google Cloud
    • Cloud Toolbox
  • Biology-AI Toolbox
    • Overview
  • Specialized Biological Domains
    • Overview
    • Cancer Fundamentals
    • Cancer Analysis Approaches: Bio/AI
    • SVAI Research Team MVPs
  • Biological Fundamentals
    • Overview
    • Genome Analysis: The Basics
    • Proteome Analysis: The Basics
    • Transcriptome Analysis: The Basics
    • Genomic Applications
    • Transcriptomic Applications
    • Proteomic Applications
    • Multi-omics Bioinformatic Applications
  • AI fundamentals
    • Overview
    • Computational Linear Algebra Techniques
    • Machine Learning Heuristics
    • Types of Machine Learning problems: Supervised, Unsupervised, and Reinforcement Learning
    • Fundamental ML Models
    • ML Applications
    • Networks: Another type of ML topic
    • Deep Learning Fundamentals
    • You Don't Have Enough DATA
    • CNNs: An Overview
    • RNNs: An Overview
    • GANs: An overview
    • Deep Belief Networks: Deep Dive
    • Autoencoders: Deep Dive
    • DL Applications
Powered by GitBook
On this page
  1. AI fundamentals

DL Applications

PreviousAutoencoders: Deep Dive

Last updated 6 years ago

Exploring high-dimensional data: t-SNE

An effective way to understand non-linear transformations

The goal is to take a set of points in a high-dimensional space and find a faithful representation of those points in a lower-dimensional space, typically the 2D plane.

The algorithm is non-linear and adapts to the underlying data, performing different transformations on different regions.

A second feature of t-SNE is a tune-able parameter, “perplexity,” which says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures.

Getting the most from t-SNE may mean analyzing multiple plots with different perplexities.

An additional hyperparameter to tune is the number of steps/iterations:

  • If you see a t-SNE plot with strange “pinched” shapes, chances are the process was stopped too early. Unfortunately, there’s no fixed number of steps that yields a stable result. Different data sets can require different numbers of iterations to converge.

  • a default safe number for most datasets is 5000

Usually, if you re-run the algorithm on the same dataset under the same hyperparameters, you should see the same behavior, but there's always a few exceptions.

Separately, the size of clusters don't mean anything because the t-SNE algorithm adapts its notion of “distance” to regional density variations in the data set. As a result, it naturally expands dense clusters, and contracts sparse ones, evening out cluster sizes. Overall, you cannot see relative sizes of clusters in a t-SNE plot. Also, distances between well-separated clusters in a t-SNE plot may mean nothing.

Low perplexity values often lead to non-statistically relevant clusters. Recognizing these clumps as random noise is an important part of reading t-SNE plots. However, after appropriately increasing the perplexity, t-SNE performs something really powerful on high-dimensional normal distributions, which are very close to uniform distributions on a sphere: evenly distributed, with roughly equal spaces between points. And that's exactly what you see. In that way, it's actually more accurate than a linear projection:

Sometimes you can read topological information off a t-SNE plot, but that typically requires views at multiple perplexities:

here.