Recommended: External Data Sources

Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!

Introduction

Under the particular domain data; some databases are generalized, so it will appear under multiple domains.

Cancer Databases

TCGA

Broad firehose database is best to gather relevant cancer data.

COSMIC - Catalogue of Somatic Mutations in Cancer

Source can be found here.

"The current set of mutational signatures is based on an analysis of 10,952 exomes and 1,048 whole-genomes across 40 distinct types of human cancer. These analyses are based on curated data that were generated by The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC), and a large set of freely available somatic mutations published in peer-reviewed journals. Complete details about the data sources will be provided in future releases of COSMIC."

Important Note : This catalogue only includes SNPs, and future versions of COSMIC will include other variation types such as: indels, structural rearrangements, and localized hypermutation such as kataegis) and cancer samples. With more cancer genome sequences and the additional statistical power this will bring, new signatures may be found, the profiles of current signatures may be further refined, signatures may split into component signatures and signatures may be found in cancer types in which they are currently not detected.

CMap on the Cloud: https://clue.io/

Data and Tools

The CMap dataset of cellular signatures catalogs transcriptional responses of human cells to chemical and genetic perturbation. Here you can find the 1.3M L1000 profiles and the tools for their analysis.

A total of 27,927 perturbagens have been profiled to produce 476,251 expression signatures. About half of those signatures make up the Touchstone (reference) dataset generated from testing well-annotated genetic and small-molecular perturbagens in a core panel of cell lines. The remainder make up the Discover dataset, generated from profiling uncharacterized small molecules in a variable number of cell lines.

Metabolic Disease Databases

RAMEDIS, the Rare Metabolic Diseases Database

The RAMEDIS system is a platform independent, web-based information system for rare metabolic diseases based on filed case reports.

Tools/Databases CrowdSources shared from p1RCC hackathon:

Tools/databases

- The SnpEff reports that are in the same folder as the VCF files appear to already annotate synonymous/missense/nonsense variants

- ClinVar has a list of known clinically relevant variants

-The COSMIC database mentioned above

- Variant effect predictor: https://uswest.ensembl.org/info/docs/tools/vep/index.html

- Personal Cancer Genome Reporter: https://github.com/sigven/pcgr

- PhenVar has been mentioned - https://phenvar.colorado.edu/

- I have also heard of SNPnexus - http://www.snp-nexus.org/

- DNA conservation tracks can be a good indicator of whether a sequence is important (if a sequence is conserved across species, it's likely important). I believe CADD scores variants according to conservation info: http://cadd.gs.washington.edu/

- If you are looking at mutations in non-coding regulatory elements, you can use a tool like DeepSEA to see if the mutation is disrupting biological processes such as the binding of a transcription factor: http://deepsea.princeton.edu/job/analysis/create/

- You can also look at the region in the genome browser to see if the mutation is disrupting anything interesting. UCSC genome browser: https://genome.ucsc.edu/, WashU genome browser: https://epigenomegateway.wustl.edu/

- The ENCODE project tries to categorize non-coding regulatory elements. https://www.encodeproject.org/

Last updated