Cancer Analysis Approaches: Bio/AI
Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!
Overview of AI applied to Cancer
Overview came from this paper.
Before this deep-dive, if you need some refreshing on cancer fundamentals, check out this page.
Traditional Machine Learning for Cancer
Traditional machine learning algorithms such as decision trees, random forests (RF), artificial neural networks (ANN), support vector machines (SVM) have been successfully applied to build predictive models for various aspects related to cancer including:
prognosis of cancer
classification of cancer types from data sources such as clinical data, SNP’s, gene expressions.
Deep learning for Cancer
Recently, deep learning has shown remarkable performances for predicting
the specificity of DNA and mRNA binding sites
functional classification
protein folding pattern
cancer categorization
This paper dives into automatic detection and prediction of the either oncogenes or cancer suppression genes from their three dimensional features is a big step in discovering their structural characteristics to improve the state-of-the-art in making a dent in cancer treatments.
Tumor suppression genes (TSGs) and proto-oncogenes (OGs) detection improves the cancer identification performance as discussed in [14]. They have used genomic data and their variants from the cancer genomic atlas (TCGA), ICGC and COSMIC.
So, how can we classify OGs and TSGs only from their three-dimensional protein structures and the biochemical properties of the amino acids that forms the structure?
Generate and Classify Mutational Signatures
In the previous section, Cancer Fundamentals, we dove into some background on mutation signatures unique to certain types of cancers. Here, we'll go through an analysis pipeline that tackles this question:
Given the genome of a patient with rare form of cancer X, can we find a subset of well-studied cancers that is biologically similar to X?
By tackling this question, we open up the potential to suggest treatments based on existing therapeutics in the well-known tumors most similar to cancer X. You may consider this drug-repurposing.
Pipeline Walk-Through
Real-world application by C3 team during NF2 hackathon.
Let's say we have a person named Kit. Kit has been diagnosed with a rare form of brain cancer, and the SVAI community wants to help him out. Let's say we already obtained our list of genes that may be linked to Kit's brain cancer. As scientists and researchers, we might be curious about whether Kit's DNA has some clues to help us better understand his cancer.
First, we need to get the relevant GO annotations for Kit's type of cancer. Since he has a form of brain cancer, the most relevant types of data from the Broad Institute site are:
Glioblastoma multiforme
Glioma
Under these datasets, we specifically want annotated point mutations and indels with functional data relevant to cancer researchers. Annotations include gene names, functional consequence (e.g. Missense), PolyPhen-2 predictions, and cancer-specific annotations from resources such as COSMIC, Tumorscape, and published MutSig results. This type of annotation is available via:
In the table, under the column 'Disease Name', located the row containing your disease of interest (e.g. for our example case, we're interest in brain cancers, Glioblastoma multiforme and Glioma).
Then in the same row(s), click on the link under the 'Data' column.
4. Now you should be directed to a window that looks like this:
5. Since we're interested in the genes associated with cancer, and we've already chosen the mutated genes unique to the tumor sample, we want to know how the mutation may have affected our biological and molecular processes that contributed to the cancer development and progression. This relevant data is under the section 'Mutation Annotation File', attached to the file name 'Mutation_Packager_Oncotated_Calls'.
6. Now that we've downloaded and unzipped our data of interest, we can extract the relevant mutation data associated to our gene of interest. C3 team has a great example:
Separate Python version (create code for 1 + 2, talk through 3):
import dataset of interest, including reference
grab cols for feature matrices, save the files
Steps forwards with feature matrices: generate mutation signatures via deconstruct_sigs_py (installing is currently broken), perform downstream PCA and cluster and find the cluster that most identified with; consider additional pathways for the other feature matrices
Explain out the different feature matrices to
Look at best practices applied with C3's approach to classifying the signature and deriving molecular relevance: https://github.com/SVAI/C3/blob/master/C3finalpresentation.pdf
(consider if t-SNE is a better classifier)
Denoising autoencoders to extract features from breast cancer gene expression data
From this paper.
More about the Potential of Autoencoders
Features are either nodes in the encoded layer, or sets of genes whose weights most greatly influence a certain node. Both these demonstrate the ability of autoencoders to pick up individual features which can identify tumor subtypes, estrogen-receptor status, and predict patient survival. This success in classifying tumors of different types based on features learned from a single dataset suggests that gene expression features may be shared across the human transcriptome.
Would training a single autoencoder with a variety of expression profiles generate a universal common set of features that are predictive in all tissue types? Given that the method seems to work for gene expression data, a second question that arises is - are there any useful features to be learned from using an auto-encoder on DNA sequence data?
Last updated