Genome Analysis: The Basics

Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!

Note: Content adapted from Av Shrikumar's p1RCC collective brainstorm document. (See p1RCC page).

What is genomic data?

Um, let me get back to you on that loaded question.

Somatic vs. Germline Mutations

Germline mutations are, put simply, the variants that "make Bill Bill". These are variants that he inherited from his parents. They are called "germ line" because they were in the original fertilized egg that developed into Bill. These mutations (or "variants") are identified by looking at the whole blood sample (folder "Buckets/p1rcc/rarekidneycancer_patient_0/F18FTSUSAT0015_HUMaasR/WBA") and comparing them to a "reference" human genome.

Somatic mutations are those that arose over the course of Bill's life, and are presumably what resulted in the tumour (since he has no family history of this). These mutations can be identified by comparing the variants identified in the whole blood sample (folder "Buckets/p1rcc/rarekidneycancer_patient_0/F18FTSUSAT0015_HUMaasR/WBA") to the variants identified in the tumour ("Buckets/p1rcc/rarekidneycancer_patient_0/F18FTSUSAT0015_HUMaasR/T1_1A"). The variant calls for this are at "Buckets/p1rcc/somatic". They were produced with the tool https://github.com/Illumina/strelka.

"What makes a mutation 'important'? What does 'variant impact' mean?"

Not all mutations/"variants" that occur in are genome have a deleterious effect. Let's talk about different types of variants.

Coding Variants

These are variants that occur in the DNA sequence encoding for proteins - i.e. "exons". Only about 2% of your genome encodes for proteins. When a protein is produced in a particular cell, the gene is said to be "expressed". A larger fraction (~20%) of your genome is involved in regulating when genes are turned on or off. And a large part is still being figured out.

Disease causing mutations tend to be those that occur in the coding sequence of a gene. These are divided into "synonymous" and "non-synonymous" mutations. Synonymous mutations are those that do not actually affect the amino acid sequence of the protein. This occurs because multiple DNA sequences can map to the same amino acid sequence. About 1/3rd of DNA substitutions occurring in coding sequences are synonymous, and these do not tend to be harmful.

"Non-synonymous" mutations, on the other hand, do alter the amino acid sequence and thus do tend to be harmful - how harmful they are depends on exactly how they alter the amino acid sequence. There are several types of "non-synonymous" mutations.

(1) Missense" mutations cause one amino acid to be substituted with another. Sometimes, the substitution may replace one amino acid with another amino acid that has similar chemical properties, and those missense mutations may not be too bad. However, missense mutations that cause an amino acid to be replaced with another that has very different chemical properties can be disastrous. (SNPs - or single nucleotide polymorphisms).

(2) Nonsense mutations replace the amino acid with what is called a "stop codon" - i.e. they cause the protein sequence to be truncated prematurely. As you can imagine, if this truncation happens early in the protein sequence, the effect can be very harmful - far more harmful than missense mutations on average. If the truncation happens very late in the protein sequence, the effect may not be too bad depending on how lucky you are.

(3) Frameshift mutations are insertions or deletions change the translational reading frame of the coding sequence and result in a different protein sequence (amino acids are represented in the DNA sequence by “codons” which are sets of 3 DNA bases - in other words, the DNA is read 3 bases at a time; if you delete a base in the sequence, you disrupt the ‘reading frame’, meaning that the remaining amino acids are likely to all be read incorrectly).

(4) In-frame insertions or deletions insert or delete one or more codons from the DNA (a “codon” is a set of 3 DNA bases that map to an amino acid). The protein is generally expressed normally, but has one or more amino acids inserted or deleted.

Non-synomous mutations are much more likely to be disease causing than synonymous mutations. Also, Nonsense and Frameshift mutations are much more likely to be disease causing than In-frame insertions/deletions or missense mutations.

When you identify a coding variant, it may be good to check that the gene affected by the coding variant is in fact active in the kidney cells. Disrupting the protein sequence of a gene that isn't even turned on in the kidney cell would be unlikely to make a difference. This is where the TCGA gene expression data could come in handy.

There exist tools to automatically score the impact of variants that occur in coding sequences. The snpEff reports included in the same folder as the VCF files appear to give a summary of synonymous vs. missense vs. nonsense mutations, so they could be a good starting point after we figure out how to filter the VCF files for the variants that are likely to be real.

Splicing variants

The DNA sequence of a gene consists of "exons" and "introns". After the mRNA sequence of a gene is first created, the introns are removed and the exons are stitched together to form the final mRNA "transcript" (the one that gets converted into a protein). Different combinations of exons may be stitched together to produce slightly different transcripts. This process of stitching together exons is called "splicing", and it is an important part of gene regulation. A mutation that occurs in an intron can indirectly affect the final coding sequence of a gene by disrupting the splicing process; it can cause exons to get stitched together in the wrong way, or even cause introns to be retained when they shouldn't be. I am not an expert on splicing, but there exist computational models to score the impact of splicing variants and they can be worth looking into if you find mutations in introns.

Intronic mutations involving the first two nucleotides (GT) of the intron and the last two nucleotides of the intron (AG) universally disrupt splicing, and much more likely to be disease causing than variants in other parts of the intron.

Non-coding regulatory variants

While many harmful mutations disrupt protein sequences, a good proportion can occur in regions that are important for determining when the proteins are expressed. These variants occur in regions known as "promoters" (which are sequences that lie upstream of a gene and are responsible for determining when the gene is turned on/off) and "enhancers" (which are sequences that don't necessarily lie close to a gene but are still responsible for turning it on/off because they interact with the gene when the genome folds in 3d space). The ENCODE project tries to categorize these regulatory elements and may be a good place to start. However, if you don't have a lot of bio experience, it may be good to focus on the coding variants which are simpler to annotate.

How is mutation impact assessed?

It is important to consider the processes through which a variant becomes annotated as deleterious. Clivar (https://www.ncbi.nlm.nih.gov/clinvar/) maintains a list of variants that are known to cause disease in humans. The COSMIC database (https://cancer.sanger.ac.uk/cosmic) contains a similar set of recurrent variants in cancer.

Because we still have much to discover, many variants haven’t received a full study annotation. There are prediction programs like SIFT, CADD, Polyphen, and others, which train models using the known disease causing variants to enable tools to predict whether Variants of Unknown Significance (VUS) are damaging.

Remember that the MOST damaging variants NEVER show up in the databases because they are incompatible with human life. So they will never be seen in a research study to make it into a database.

For this reason, geneticists have developed a set of rules for predicting which novel variants are most likely to be disease causing. The most important factors are the predicted effect (nonsense, frameshift, splice site variant, gene/exon deletions) and gene location.

Last updated