Genomic Applications
Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!
Last updated
Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!
Last updated
In this section, we'll dive into how the most commonly used genomic data is organized in the real world.
Let's say we have a person named Kat. Kat has been diagnosed with a rare form of lung cancer, and the SVAI community wants to help her out. As scientists and researchers, we might be curious about whether Kat's DNA has some clues to help us better understand her cancer.
One major clue may exist in the DNA code itself -- cancer is commonly known to alter the regulation of genes or genes themselves in order for cancer to grow in a more favorable environment. In order to figure out if her DNA has been altered in relation in cancer, we classify these types of changes (aka 'variations') under somatic mutations -- mutations introduced into the DNA anytime after an individual is born. However, we don't want to find all mutations introduced to the genome after birth, but the mutations specifically linked to cancer. These mutations may be linked to the introduction of cancer and allowed it to progress to later stages.
The process to robustly uncover somatic mutations related to cancer can be illustrated below:
The basic idea behind this 'subtracting' logic is that the somatic mutations related to cancer is isolated where we see the phenotype of the cancer -- her lung region. For all localized cancer phenotypes, the assumption is that the cancer mutations is isolated to that region only. Then, we can assume that her blood DNA lacks the somatic cancerous mutations and we can compare the tumor DNA to 'normal DNA' (in this case, blood DNA) and the variations unique to the tumor DNA are the mutations that related to Kat's lung cancer.
After finding these 'uniquely' tumor mutations, we need to further filter out common variants existing in the general population because these variants are most likely harmless to due the fact that harmful variants in the population don't exist very long (aka 'negative selection'). Survival of the fittest ya know.
And viola! You've got your somatic variants of interest. Now let's see this baby in action:
The source code can be found here. Thanks to C3 team for your elegant code!
First, we need to important libraries to use their powerful tools:
vcf is <>. pandas is <>
Next, let's define a function, getInfo
, that extracts important details about our variants:
Special Note: If you need to find the sample id of the vcf file, this can be done via accessing the metadata of the vcf file:
vcf_reader = vcf.Reader(open('<vcf_filename>', 'r'))
vcf_reader.samples
Now, let's read in our tumor variant DNA, blood variant DNA, and population variant DNA (aka 'reference' variant DNA) and for each file, let's apply our function, getInfo
, to extract the important variant details and save it to a variable called my_dict
. To efficiently store the data into a dataframe and easily access that dataframe without reloading the original file again, we call the function, to_pickle()
, to serialize the dataframe object:
Cool beans. Now that we have a function to extract out the type of variant metadata we want and we have our data, let's deserialize our dataframe objects and start 'subtracting':
Now that we have our variants of interest, theoretically we'd be done, but alas, we exist in the real world, and the variants we recorded in our vcf files may not be real variants. To address this issue, our vcf file contains the following columns, 'AD' and 'GQ' for quality control. We basically want to drop variants that have less than 15 reads supporting the alternate allele ( AD<15 ) and also drop variants called at a confidence level of less than 95% (GQ<95):
Finally, we're done!! πLet's save our hard work:
Important note: If our feature set (number of variants in our cancer_only
dataframe) is too big to run further computations, a statistically clean option to reduce our dataframe would be to randomly drop some variants:
(Source paper for best practices can be found here.)
The Gene Ontology (GO) is a controlled vocabulary composed of >38 000 precise defined phrases called GO terms that describe the molecular actions of gene products, the biological processes in which those actions occur and the cellular locations where they are present.
The annotation object or molecular entity are those defined by the Sequence Ontology [(13), http://www.sequenceontology.org] and includes complex, gene, gene_product, miRNA, ncRNA, protein, protein_complex, protein_structure, RNA, rRNA, snoRNA, snRNA, transcript, tRNA and polypeptide. While annotations are typically created for chromosomal features, such as a gene for its protein or ncRNA product, other types of objects can be annotated including groups of gene products that make a complex. The annotation object can be associated to a GO term from one or more of the three aspects of the GO (Molecular Function, Biological Process and Cellular Component). A gene product is the most common object of annotation, and all such objects require a stable identifier such as those specified by sequence databases maintained by European Bioinformatic Institute (EBI) and National Center for Biotechnology Information (NCBI). Model Organism Databases (MODs) also maintain unique identifiers that often represent specific types of molecular entities such as RNA transcripts that often do not have an identifier from one of the archival repositories.
GO terms can be assigned to gene products on the basis of sequence similarity using the evidence code 'Inferred from Sequence or structural Similarity' (ISS) with a custom reference, GO_Reference (GO_REF:0000024), as described in the next section. Potential homologs are initially identified using sequence similarity search programs such as BLAST. The significance of the sequence similarity is then verified manually using a combination of sequence resources and analysis tools, including phylogenetic and comparative genomics databases such as Ensembl Compara (16), INPARANOID (17) and OrthoMCL (18).
Although the similarity criteria required to make these annotations are defined by the annotating group, the GOC has established several rules for making these assignments. They are as follows:
Mandatory inclusion of a stable database identifier that identifies the similar gene/gene product in the βWITH/FROMβ field (column 8 in Table 3)
The similar gene must be experimentally characterized; to avoid circular inferences, the GO term should only be assigned if the similar gene/gene product is, or can be annotated, with the same term (or a more specific child term) using an experimental evidence code (e.g. Inferred from Direct assay, IDA; Inferred from Mutant Phenotype, IMP; Inferred from Genetic Interaction, IGI, Inferred from Physical Interaction, IPI; Inferred from Expression Pattern, IEP). Annotations made with the NOT qualifier should not be transferred.
Sequence characteristics can be used to infer GO annotations for all three aspects of the ontology. However, care should be taken when transferring biological process annotations, as cellular processes and metabolic processes, for example, may be more readily inferred from sequence similarity than developmental processes which may be species- or clade-specific
Molecular function describes activities, such as catalytic, binding or transporter activities, at the molecular level. GO molecular function terms describe activities rather than the entities (complexes, gene products or molecules) that perform the actions. In addition sequence comparison methods (similarity to other organisms) are often used to predict the molecular function of a gene product because functions are often associated with conserved protein domains (see Figure 2 to compare evidence from experimental and nonexperimental results).
The key question to ask when selecting a Molecular Function term is whether the experimental results show βhowβ the gene product accomplishes its role.
The Molecular function ontology also contains terms that describe proteinβprotein interactions. A rule of thumb is to determine whether the gene product being annotated is accomplishing a biological purpose by binding to another protein: if so, protein binding could be one of its functions.
Furthermore, it's unnecessary to co-annotate the binding of the substrates, cofactors or products, as the enzymatic activity is defined by the compounds being bound, if only in a transition state.
Biological process describes biological goals accomplished by one or more ordered assemblies of molecular functions. A biological process is not equivalent to a pathway. Specifically it does not represent any of the dynamics or dependencies that would be required to describe a pathway.
On occasion when authors present experimental results for a gene productβs role in a specific type of process, they then extrapolate to infer its role in other related processes.
Direct versus indirect effect. Many GO Biological Process annotations are assertions based upon mutant phenotypes. When annotating based upon mutant phenotype results, it can be difficult to discern if a gene product is directly involved in the process for which the authors screened (assayed) or if its absence instead results in an indirect or downstream effect. (For example, if any of the S. cerevisiae proteins involved in βRNA splicingβ [GO:0008380] are mutated, translation is affected. This is a downstream effect because most of the genes encoding ribosomal proteins have introns (example, yeast ribosomal genes RPL2A, RPL2B, RPS11A, RPS11B) and if splicing genes are mutated, these ribosomal genes are not processed thereby affecting ribosomal assembly and hence translation.)
The βresponse toβ GO terms are intended to annotate gene products that are required for the response to occur and are a direct result of the organismβs reaction to the stimuli (e.g. production of a gene product used to degrade a toxin or signaling to initiate immune cell division in response to a parasite).
Regulation of a biological process is defined as a role that modulates the frequency, rate or extent of that process.
Cellular Component describes locations, at the levels of subcellular structures and macromolecular complexes.
Let's say we have a person named Kit. Kit has been diagnosed with a rare form of brain cancer, and the SVAI community wants to help him out. Let's say we already obtained our list of genes that may be linked to Kit's brain cancer. As scientists and researchers, we might be curious about whether Kit's DNA has some clues to help us better understand his cancer.
First, we need to get the relevant GO annotations for Kit's type of cancer. Since he has a form of brain cancer, the most relevant types of data from the Broad Institute site are:
Glioblastoma multiforme
Glioma
Under these datasets, we specifically want annotated point mutations and indels with functional data relevant to cancer researchers. Annotations include gene names, functional consequence (e.g. Missense), PolyPhen-2 predictions, and cancer-specific annotations from resources such as COSMIC, Tumorscape, and published MutSig results. This type of annotation is available via:
In the table, under the column 'Disease Name', located the row containing your disease of interest (e.g. for our example case, we're interest in brain cancers, Glioblastoma multiforme and Glioma).
Then in the same row(s), click on the link under the 'Data' column.
4. Now you should be directed to a window that looks like this:
5. Since we're interested in the genes associated with cancer, and we've already chosen the mutated genes unique to the tumor sample, we want to know how the mutation may have affected our biological and molecular processes that contributed to the cancer development and progression. This relevant data is under the section 'Mutation Annotation File', attached to the file name 'Mutation_Packager_Oncotated_Calls'.
6. Now that we've downloaded and unzipped our data of interest, we can extract the relevant mutation data associated to our gene of interest. C3 team has a great example:
(TODO: Look at code carefully, and have a more guidance code walkthrough. If have time, convert code to Python)
Separate Python version (create code for 1 + 2, talk through 3):
import dataset of interest, including reference
grab cols for feature matrices, save the files
Steps forwards with feature matrices: generate mutation signatures via deconstruct_sigs_py (installing is currently broken), downstream PCA and cluster and find the cluster that most identified with; consider additional pathways for the other feature matrices
Explain out the different feature matrices to
Look at best practices applied with C3's approach to classifying the signature and deriving molecular relevance: https://github.com/SVAI/C3/blob/master/C3finalpresentation.pdf
(consider if t-SNE is a better classifier)
Broad firehose database is best to gather relevant cancer data.