What Data Do We Work With?

Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!

Introduction

Here, we dive into the data that allows our research community to yield robust, promising insights into understanding a patient's clinical case. Depending on the type of disease the patient has, we gather a unique portfolio of data. That data may be genomic/genetics, transcriptomic, proteomic, metabolic, and/or clinical.

Minimally, we try to gather clinical and genomic data. A great next plus would be transcriptome data.

Let's dive in.

Common Genomic Data Files

In this section, we'll dive into how the most commonly used genomic data is organized in the real world.

Whole Genome Sequencing (WGS)

Whole Exome Sequencing (WES)

Variant Calling File (VCF)

Gene Expression data

There are three main ways to analyze gene expression: qPCR, microarrays, or RNA-seq. The industry choice at this point is RNAseq data because relative to the other techniques, RNA-seq enables us to look at differential expressions at a much broader dynamic range, to examine DNA variations (SNPs, insertions, deletions) and even discover new genes or alternative splice variations using just one dataset.

If we're providing RNA-seq data, it'll in a data format called FASTA/FASTQ. This article provides a really nice comprehensive introduction into the entire RNA-seq curation, processing, and mapping process.

However, relevant to you is probably just mapping the RNA-seq data to a reference, or performing alignment free methods if you have a really high coverage rate. STAR is a great option for reference-mapping and salmon for reference-free alignment. Note, there's a lot of really great, easy-to-use alignment software packages!

Here's an example folder to practice alignment using TopHat2.

Clinical Data

(soon to be added!)

Metabolomic Data

(soon to be added!)

Last updated