Overview

Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!

Introduction

Currently, we're just going over potentially, relevant and very powerful tools for applying AI towards biomedicine. The tools introduced here are biased towards deep learning approaches. You might see some of these packages applied in code-walkthroughs throughout this notebook.

Note: This overview is definitely not comprehensive -- a pretty good list of deep learning tools I'll reference quite a lot exists in this github.

Generic 'Omics Tools

Thus, here's a great comprehensive up-to-date list, but I'll introduce a noteworthy one here.

pysster

Description: Learning Sequence and Structure Motifs in DNA and RNA Sequences using Convolutional Neural Networks[github][preprint]

A toolbox for learning motifs from DNA/RNA sequence data using convolutional neural networks, this Tensorflow-based library supposedly runs on GPU out of the box and also does things like hyperparameter optimization and visualizations of what different network layers are learning.

Genomics

Variant Calling

DeepVariant [github][preprint]

Instead of using the nucleotides in the sequenced DNA fragments directly (in the form of the symbols A, C, G, T), they first converted the sequences into images and then applied convolutional neural networks to these images (which represent "pile-ups" or DNA sequences; stacks of aligned sequences.) This turned out to be a very effective way to call variants as proven by both Google's own and independent benchmarks.

Gene Expression

We've yet to directly curate and host a patient's gene expression data, but there's plenty of online RNAseq databases to pull potentially relevant data. If you decide to analyze gene expression data, here's a nice deep learning packages list.

Predicting enhancers and regulatory regions

This is pretty relevant to the genomic data we gather from the patient. Here's a great comprehensive up-to-date list, but I'll introduce a few noteworthy ones here.

DNA-Level Splice Junction Prediction

paper here. [include intro]

Proteomics

We don't directly curate proteomic data from the patient, but you might be able to grab relevant proteomics data from solid online databases. Thus, here's a great comprehensive up-to-date list, but I'll introduce a few noteworthy ones here.

CNN for Cancer Suppressor Gene and Oncogene Prediction

Description: A Deep Learning Model for Predicting Tumor Suppressor Genes and Oncogenes from PDB Structure [github][bioRxiv preprint]

The authors use CNNs on feature maps extracted from protein 3D structures in the Protein Data Base (PDB) to predict oncogenes and tumor-suppressor genes.

Background

This paper dives into automatic detection and prediction of the either oncogenes or cancer suppression genes from their three dimensional features is a big step in discovering their structural characteristics to improve the state-of-the-art in making a dent in cancer treatments.

Tumor suppression genes (TSGs) and proto-oncogenes (OGs) detection improves the cancer identification performance as discussed in [14]. They have used genomic data and their variants from the cancer genomic atlas (TCGA), ICGC and COSMIC.

So, how can we classify OGs and TSGs only from their three-dimensional protein structures and the biochemical properties of the amino acids that forms the structure?

Previously, prediction of the functional annotation of proteins is being improved by various methods such as prediction by sequence similarity [15, 16], evolutionary relations [17], genetic interactions [18], protein-protein interactions [19], protein structures and gene-ontology hierarchy [9, 20, 21].

Comparing the AUROC values of this model and values reported by state-of-the-art statistical methods for OG versus TSG identification: This model outperforms the six out of eight methods and is close to the best AUROC. (Truncation Rate and Random Forest are slightly better.)

This investigation was established in two steps: 1) Protein feature extraction from the PDB tertiary structure; 2) modeling the gene patterns using a parallel deep convolutional neural network (CNN). The proposed DCNN preserves the spatial information of the tertiary structure while modeling the protein structure/features via three parallel, independent visual feature extraction modules.

Finally, the fully connected neural network of the DCNN classifies the combined visual features. The experimental results showed high performance of 82.57% and 0.887 accuracy rate and area under the ROC curve, respectively. The initial success of our model warrants our future study to apply the same deep learning approach on new datasets for predicting different cancer types to identify the cancer drivers.

Deep-RBPPred

Description: Predicting RNA binding proteins in the proteome scale based on deep learning [code][bioRxiv preprint]

Predicts RNA-binding proteins using CNNs.

Misc

Autoencoding

A method called D-GEX has been developed that uses a multi-task deep neural network trained on the publicly available CMAP dataset, to predict the expression of all genes, given the expression of ~1000 genes.
A related, much harder task is predicting the expression level of an exon or transcript from DNA sequence data. Expression level depends not only on the sequence but also on the cellular context. This paper, titled ‘Deep learning of the tissue-regulated splicing code’ describes a model that predicts the percent of transcripts with exon spliced in (PSI), given the DNA sequence surrounding the exon. Hand-generated genomic features are used to train a model that can predict splicing patterns based on genomic features in specific mouse cell types. After reducing feature space with an autoencoder, the encoded features with additional inputs representing cell type are used to train a multilayer fully connected network. Based on this method, the authors developed and validated a tool that can score the effect of a single-nucleotide-variant (or mutation) on splicing.

Convolutional networks for epigenomics

using convolutional neural networks to answer epigenomic questions such as predicting transcription factor binding sites, enhancer regions and chromatin accessibility from gene sequence.
often involves training on more than one data type, custom CNN architectures are used
- training the same network to predict 2 different targets, or combining different kinds of input via independent convolution modules.

PreviousCloud Toolbox NextOverview

Last updated 6 years ago