Research to the People
  • What is Research to the People?
  • About the Data
    • What Data Do We Work With?
    • Recommended: External Data Sources
  • Hacking on the Cloud
    • Getting Set-up on Google Cloud
    • Cloud Toolbox
  • Biology-AI Toolbox
    • Overview
  • Specialized Biological Domains
    • Overview
    • Cancer Fundamentals
    • Cancer Analysis Approaches: Bio/AI
    • SVAI Research Team MVPs
  • Biological Fundamentals
    • Overview
    • Genome Analysis: The Basics
    • Proteome Analysis: The Basics
    • Transcriptome Analysis: The Basics
    • Genomic Applications
    • Transcriptomic Applications
    • Proteomic Applications
    • Multi-omics Bioinformatic Applications
  • AI fundamentals
    • Overview
    • Computational Linear Algebra Techniques
    • Machine Learning Heuristics
    • Types of Machine Learning problems: Supervised, Unsupervised, and Reinforcement Learning
    • Fundamental ML Models
    • ML Applications
    • Networks: Another type of ML topic
    • Deep Learning Fundamentals
    • You Don't Have Enough DATA
    • CNNs: An Overview
    • RNNs: An Overview
    • GANs: An overview
    • Deep Belief Networks: Deep Dive
    • Autoencoders: Deep Dive
    • DL Applications
Powered by GitBook
On this page
  • Introduction
  • Common Genomic Data Files
  • Whole Genome Sequencing (WGS)
  • Whole Exome Sequencing (WES)
  • Variant Calling File (VCF)
  • Gene Expression data
  • Clinical Data
  • Metabolomic Data
  1. About the Data

What Data Do We Work With?

Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!

PreviousWhat is Research to the People?NextRecommended: External Data Sources

Last updated 6 years ago

Introduction

Here, we dive into the data that allows our research community to yield robust, promising insights into understanding a patient's clinical case. Depending on the type of disease the patient has, we gather a unique portfolio of data. That data may be genomic/genetics, transcriptomic, proteomic, metabolic, and/or clinical.

Minimally, we try to gather clinical and genomic data. A great next plus would be transcriptome data.

Let's dive in.

Common Genomic Data Files

In this section, we'll dive into how the most commonly used genomic data is organized in the real world.

Whole Genome Sequencing (WGS)

Whole Exome Sequencing (WES)

Variant Calling File (VCF)

Gene Expression data

There are three main ways to analyze gene expression: qPCR, microarrays, or RNA-seq. The industry choice at this point is RNAseq data because relative to the other techniques, RNA-seq enables us to look at differential expressions at a much broader dynamic range, to examine DNA variations (SNPs, insertions, deletions) and even discover new genes or alternative splice variations using just one dataset.

If we're providing RNA-seq data, it'll in a data format called FASTA/FASTQ. This article provides a really nice into the entire RNA-seq curation, processing, and mapping process.

However, relevant to you is probably just mapping the RNA-seq data to a reference, or performing alignment free methods if you have a really high coverage rate. STAR is a great option for reference-mapping and salmon for reference-free alignment. Note, there's a lot of really great, easy-to-use alignment software packages!

Clinical Data

(soon to be added!)

Metabolomic Data

(soon to be added!)

comprehensive introduction
Here's an example folder to practice alignment using TopHat2.