Research to the People
  • What is Research to the People?
  • About the Data
    • What Data Do We Work With?
    • Recommended: External Data Sources
  • Hacking on the Cloud
    • Getting Set-up on Google Cloud
    • Cloud Toolbox
  • Biology-AI Toolbox
    • Overview
  • Specialized Biological Domains
    • Overview
    • Cancer Fundamentals
    • Cancer Analysis Approaches: Bio/AI
    • SVAI Research Team MVPs
  • Biological Fundamentals
    • Overview
    • Genome Analysis: The Basics
    • Proteome Analysis: The Basics
    • Transcriptome Analysis: The Basics
    • Genomic Applications
    • Transcriptomic Applications
    • Proteomic Applications
    • Multi-omics Bioinformatic Applications
  • AI fundamentals
    • Overview
    • Computational Linear Algebra Techniques
    • Machine Learning Heuristics
    • Types of Machine Learning problems: Supervised, Unsupervised, and Reinforcement Learning
    • Fundamental ML Models
    • ML Applications
    • Networks: Another type of ML topic
    • Deep Learning Fundamentals
    • You Don't Have Enough DATA
    • CNNs: An Overview
    • RNNs: An Overview
    • GANs: An overview
    • Deep Belief Networks: Deep Dive
    • Autoencoders: Deep Dive
    • DL Applications
Powered by GitBook
On this page
  • Overview of AI applied to Cancer
  • Traditional Machine Learning for Cancer
  • Deep learning for Cancer
  • Generate and Classify Mutational Signatures
  • Pipeline Walk-Through
  • Denoising autoencoders to extract features from breast cancer gene expression data
  • More about the Potential of Autoencoders
  1. Specialized Biological Domains

Cancer Analysis Approaches: Bio/AI

Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!

PreviousCancer FundamentalsNextSVAI Research Team MVPs

Last updated 6 years ago

Overview of AI applied to Cancer

Overview came from .

Before this deep-dive, if you need some refreshing on cancer fundamentals, check out this page.

Traditional Machine Learning for Cancer

Traditional machine learning algorithms such as decision trees, random forests (RF), artificial neural networks (ANN), support vector machines (SVM) have been successfully applied to build predictive models for various aspects related to cancer including:

  • prognosis of cancer

  • classification of cancer types from data sources such as clinical data, SNP’s, gene expressions.

Deep learning for Cancer

Recently, deep learning has shown remarkable performances for predicting

  • the specificity of DNA and mRNA binding sites

  • functional classification

  • protein folding pattern

  • cancer categorization

dives into automatic detection and prediction of the either oncogenes or cancer suppression genes from their three dimensional features is a big step in discovering their structural characteristics to improve the state-of-the-art in making a dent in cancer treatments.

Tumor suppression genes (TSGs) and proto-oncogenes (OGs) detection improves the cancer identification performance as discussed in [14]. They have used genomic data and their variants from the cancer genomic atlas (TCGA), ICGC and COSMIC.

So, how can we classify OGs and TSGs only from their three-dimensional protein structures and the biochemical properties of the amino acids that forms the structure?

Generate and Classify Mutational Signatures

In the previous section, Cancer Fundamentals, we dove into some background on mutation signatures unique to certain types of cancers. Here, we'll go through an analysis pipeline that tackles this question:

Given the genome of a patient with rare form of cancer X, can we find a subset of well-studied cancers that is biologically similar to X?

By tackling this question, we open up the potential to suggest treatments based on existing therapeutics in the well-known tumors most similar to cancer X. You may consider this drug-repurposing.

Pipeline Walk-Through

Let's say we have a person named Kit. Kit has been diagnosed with a rare form of brain cancer, and the SVAI community wants to help him out. Let's say we already obtained our list of genes that may be linked to Kit's brain cancer. As scientists and researchers, we might be curious about whether Kit's DNA has some clues to help us better understand his cancer.

  • Glioblastoma multiforme

  • Glioma

Under these datasets, we specifically want annotated point mutations and indels with functional data relevant to cancer researchers. Annotations include gene names, functional consequence (e.g. Missense), PolyPhen-2 predictions, and cancer-specific annotations from resources such as COSMIC, Tumorscape, and published MutSig results. This type of annotation is available via:

  1. In the table, under the column 'Disease Name', located the row containing your disease of interest (e.g. for our example case, we're interest in brain cancers, Glioblastoma multiforme and Glioma).

  2. Then in the same row(s), click on the link under the 'Data' column.

4. Now you should be directed to a window that looks like this:

5. Since we're interested in the genes associated with cancer, and we've already chosen the mutated genes unique to the tumor sample, we want to know how the mutation may have affected our biological and molecular processes that contributed to the cancer development and progression. This relevant data is under the section 'Mutation Annotation File', attached to the file name 'Mutation_Packager_Oncotated_Calls'.

Separate Python version (create code for 1 + 2, talk through 3):

  1. import dataset of interest, including reference

  2. grab cols for feature matrices, save the files

    1. Explain out the different feature matrices to

    1. (consider if t-SNE is a better classifier)

Denoising autoencoders to extract features from breast cancer gene expression data

More about the Potential of Autoencoders

Features are either nodes in the encoded layer, or sets of genes whose weights most greatly influence a certain node. Both these demonstrate the ability of autoencoders to pick up individual features which can identify tumor subtypes, estrogen-receptor status, and predict patient survival. This success in classifying tumors of different types based on features learned from a single dataset suggests that gene expression features may be shared across the human transcriptome.

Would training a single autoencoder with a variety of expression profiles generate a universal common set of features that are predictive in all tissue types? Given that the method seems to work for gene expression data, a second question that arises is - are there any useful features to be learned from using an auto-encoder on DNA sequence data?

by C3 team during NF2 hackathon.

First, we need to get the relevant GO annotations for Kit's type of cancer. Since he has a form of brain cancer, the most relevant types of data from the are:

Go to

6. Now that we've downloaded and unzipped our data of interest, we can extract the relevant mutation data associated to our gene of interest. :

Steps forwards with feature matrices: generate mutation signatures via (installing is currently broken), perform downstream PCA and cluster and find the cluster that most identified with; consider additional pathways for the other feature matrices

Look at best practices applied with C3's approach to classifying the signature and deriving molecular relevance:

From .

this paper
This paper
Real-world application
Broad Institute site
http://gdac.broadinstitute.org/
C3 team has a great example
deconstruct_sigs_py
https://github.com/SVAI/C3/blob/master/C3finalpresentation.pdf
this paper
Source.