Genomic Applications

Created by Lily Vittayarukskul for SVAI research community. Open to collaborators!

Common Genomic Data Files

In this section, we'll dive into how the most commonly used genomic data is organized in the real world.

Whole Genome Sequencing (WGS)

Whole Exome Sequencing (WES)

Variant Calling File (VCF)

Generating the VCF: Germ-line versus Somatic

Extracting Cancer Somatic Mutations

Let's say we have a person named Kat. Kat has been diagnosed with a rare form of lung cancer, and the SVAI community wants to help her out. As scientists and researchers, we might be curious about whether Kat's DNA has some clues to help us better understand her cancer.

One major clue may exist in the DNA code itself -- cancer is commonly known to alter the regulation of genes or genes themselves in order for cancer to grow in a more favorable environment. In order to figure out if her DNA has been altered in relation in cancer, we classify these types of changes (aka 'variations') under somatic mutations -- mutations introduced into the DNA anytime after an individual is born. However, we don't want to find all mutations introduced to the genome after birth, but the mutations specifically linked to cancer. These mutations may be linked to the introduction of cancer and allowed it to progress to later stages.

The process to robustly uncover somatic mutations related to cancer can be illustrated below:

The basic idea behind this 'subtracting' logic is that the somatic mutations related to cancer is isolated where we see the phenotype of the cancer -- her lung region. For all localized cancer phenotypes, the assumption is that the cancer mutations is isolated to that region only. Then, we can assume that her blood DNA lacks the somatic cancerous mutations and we can compare the tumor DNA to 'normal DNA' (in this case, blood DNA) and the variations unique to the tumor DNA are the mutations that related to Kat's lung cancer.

After finding these 'uniquely' tumor mutations, we need to further filter out common variants existing in the general population because these variants are most likely harmless to due the fact that harmful variants in the population don't exist very long (aka 'negative selection'). Survival of the fittest ya know.

And viola! You've got your somatic variants of interest. Now let's see this baby in action:

Code Walk-Through 🚢🏽

The source code can be found here. Thanks to C3 team for your elegant code!

First, we need to important libraries to use their powerful tools:

import vcf
import pandas as pd

vcf is <>. pandas is <>

Next, let's define a function, getInfo, that extracts important details about our variants:

def getInfo(vcf_file, cancer_ID, qual_flag):
    """
    Takes in a VCF file and extracts the specified columns
    vcf_file: name of the VCF file
    cancer_ID: sample ID
    qual_flag: whether to check for quality. 
               If True - filter for quality threshold at 30. 
    
    return: filtered list of dict
    """
    out = list()
    for record in vcf_file:
        if qual_flag == True and record.QUAL<=30.0:
            continue
        d = {}
        d['POS'] = record.POS # the 1- based position of the variation on the sequence
        d['REF'] = record.REF # reference base at the given position of the given reference sequence  
        d['ALT'] = record.ALT # list of alternative alleles at this position
        d['CHROM'] = record.CHROM # chromosome number on which variant is being called 
        d['GT'] = record.genotype(cancer_ID)['GT'] # genotype
        d['AD'] = record.genotype(cancer_ID)['AD'] # allelic depths for the ref and alt alleles
        d['DP'] = record.genotype(cancer_ID)['DP'] # approximate read depth
        d['GQ'] = record.genotype(cancer_ID)['GQ'] # confidence that the assigned genotype is correct 
        d['PL'] = record.genotype(cancer_ID)['PL'] # normalized, phred-scaled likelihoods for genotypes
        out.append(d)
    return out

Special Note: If you need to find the sample id of the vcf file, this can be done via accessing the metadata of the vcf file:

vcf_reader = vcf.Reader(open('<vcf_filename>', 'r'))

vcf_reader.samples

Now, let's read in our tumor variant DNA, blood variant DNA, and population variant DNA (aka 'reference' variant DNA) and for each file, let's apply our function, getInfo, to extract the important variant details and save it to a variable called my_dict. To efficiently store the data into a dataframe and easily access that dataframe without reloading the original file again, we call the function, to_pickle(), to serialize the dataframe object:

# read the vcf file for patient's tumor
vcf_reader = vcf.Reader(open('G97552.vcf', 'r')) 
# filter the VCF data
my_dict = getInfo(vcf_reader, 'OF_010116NF2_a', True)
# save the filtered data
df = pd.DataFrame(my_dict)
df.to_pickle('VCF Data_Cancer')

# read vcf file for patient's blood
vcf_reader = vcf.Reader(open('G91716.vcf', 'r'))
# filter the VCF data
my_dict = getInfo(vcf_reader, 'OF_112015SJIA_2', False)
# save the filtered data
df = pd.DataFrame(my_dict)
df.to_pickle('VCF_Data_Normal')

# read the vcf file for a reference
vcf_reader = vcf.Reader(open('G91716.vcf', 'r')) 
# filter the VCF data
my_dict = getInfo(vcf_reader, 'VB_112015SJIA_1', False)
# save the filtered data
df_r = pd.DataFrame(my_dict)
df_r.to_pickle('VCF_Data_Ref')

Cool beans. Now that we have a function to extract out the type of variant metadata we want and we have our data, let's deserialize our dataframe objects and start 'subtracting':

# read in saved dataframes
df_n = pd.read_pickle('VCF_Data_Normal') # patient's blood
df_c = pd.read_pickle('VCF_Data_Cancer') # patient's tumor
df_r = pd.read_pickle('VCF_Data_Ref') # reference patient's blood

# convert ALT column's content to a string
df_n.ALT = df_n.ALT.apply(lambda s: str(s[0]) if len(s)==1 else (str(s[0])+'*'))
df_c.ALT = df_c.ALT.apply(lambda s: str(s[0]) if len(s)==1 else (str(s[0])+'*'))
df_r.ALT = df_r.ALT.apply(lambda s: str(s[0]) if len(s)==1 else (str(s[0])+'*'))

# select rows that are common in patient's tumor and blood genomes, based on pos and chrom
common = pd.merge(df_c, df_n, how= 'inner',on=['POS','CHROM'])
common['Test'] = common['POS'].astype(str) + common['CHROM'].astype(str)
df_c['Test'] = df_c['POS'].astype(str) + df_c['CHROM'].astype(str)

# select rows of patient's tumor genome that are not present in blood genome
cancer_only = df_c[(~df_c.Test.isin(common.Test))]
cancer_only['Test'] = cancer_only['POS'].astype(str) + cancer_only['CHROM'].astype(str)

Now that we have our variants of interest, theoretically we'd be done, but alas, we exist in the real world, and the variants we recorded in our vcf files may not be real variants. To address this issue, our vcf file contains the following columns, 'AD' and 'GQ' for quality control. We basically want to drop variants that have less than 15 reads supporting the alternate allele ( AD<15 ) and also drop variants called at a confidence level of less than 95% (GQ<95):

# filter data by AD and GQ
cancer_only['AD_temp'] = cancer_only.AD.apply(lambda x: x[1])
cancer_only = cancer_only.drop(cancer_only[cancer_only.AD_temp<15].index)
cancer_only = cancer_only.drop(cancer_only[cancer_only.GQ<95].index)

# drop the temporary columns created
cancer_only = cancer_only.drop(['Test', 'AD_temp'], 1)

Finally, we're done!! πŸ™‚Let's save our hard work:

# saving clean data to .csv file
cancer_only.to_csv('VCF_clean.csv')

Important note: If our feature set (number of variants in our cancer_only dataframe) is too big to run further computations, a statistically clean option to reduce our dataframe would be to randomly drop some variants:

#feature set still too big. randomly dropout columns, keep a fraction.
nb = 600.0 # number of samples to keep
cancer_only = cancer_only.sample(frac=nb/len(cancer_only))

Linking Genes to Molecular Processes (GO annotations)

(Source paper for best practices can be found here.)

The Gene Ontology (GO) is a controlled vocabulary composed of >38 000 precise defined phrases called GO terms that describe the molecular actions of gene products, the biological processes in which those actions occur and the cellular locations where they are present.

Gene product: Object of annotation

The annotation object or molecular entity are those defined by the Sequence Ontology [(13), http://www.sequenceontology.org] and includes complex, gene, gene_product, miRNA, ncRNA, protein, protein_complex, protein_structure, RNA, rRNA, snoRNA, snRNA, transcript, tRNA and polypeptide. While annotations are typically created for chromosomal features, such as a gene for its protein or ncRNA product, other types of objects can be annotated including groups of gene products that make a complex. The annotation object can be associated to a GO term from one or more of the three aspects of the GO (Molecular Function, Biological Process and Cellular Component). A gene product is the most common object of annotation, and all such objects require a stable identifier such as those specified by sequence databases maintained by European Bioinformatic Institute (EBI) and National Center for Biotechnology Information (NCBI). Model Organism Databases (MODs) also maintain unique identifiers that often represent specific types of molecular entities such as RNA transcripts that often do not have an identifier from one of the archival repositories.

Annotation via Sequence Similarity

GO terms can be assigned to gene products on the basis of sequence similarity using the evidence code 'Inferred from Sequence or structural Similarity' (ISS) with a custom reference, GO_Reference (GO_REF:0000024), as described in the next section. Potential homologs are initially identified using sequence similarity search programs such as BLAST. The significance of the sequence similarity is then verified manually using a combination of sequence resources and analysis tools, including phylogenetic and comparative genomics databases such as Ensembl Compara (16), INPARANOID (17) and OrthoMCL (18).

Although the similarity criteria required to make these annotations are defined by the annotating group, the GOC has established several rules for making these assignments. They are as follows:

  1. Mandatory inclusion of a stable database identifier that identifies the similar gene/gene product in the β€˜WITH/FROM’ field (column 8 in Table 3)

  2. The similar gene must be experimentally characterized; to avoid circular inferences, the GO term should only be assigned if the similar gene/gene product is, or can be annotated, with the same term (or a more specific child term) using an experimental evidence code (e.g. Inferred from Direct assay, IDA; Inferred from Mutant Phenotype, IMP; Inferred from Genetic Interaction, IGI, Inferred from Physical Interaction, IPI; Inferred from Expression Pattern, IEP). Annotations made with the NOT qualifier should not be transferred.

Sequence characteristics can be used to infer GO annotations for all three aspects of the ontology. However, care should be taken when transferring biological process annotations, as cellular processes and metabolic processes, for example, may be more readily inferred from sequence similarity than developmental processes which may be species- or clade-specific

Molecular Function

Molecular function describes activities, such as catalytic, binding or transporter activities, at the molecular level. GO molecular function terms describe activities rather than the entities (complexes, gene products or molecules) that perform the actions. In addition sequence comparison methods (similarity to other organisms) are often used to predict the molecular function of a gene product because functions are often associated with conserved protein domains (see Figure 2 to compare evidence from experimental and nonexperimental results).

The key question to ask when selecting a Molecular Function term is whether the experimental results show β€˜how’ the gene product accomplishes its role.

The Molecular function ontology also contains terms that describe protein–protein interactions. A rule of thumb is to determine whether the gene product being annotated is accomplishing a biological purpose by binding to another protein: if so, protein binding could be one of its functions.

Furthermore, it's unnecessary to co-annotate the binding of the substrates, cofactors or products, as the enzymatic activity is defined by the compounds being bound, if only in a transition state.

Biological process

Biological process describes biological goals accomplished by one or more ordered assemblies of molecular functions. A biological process is not equivalent to a pathway. Specifically it does not represent any of the dynamics or dependencies that would be required to describe a pathway.

On occasion when authors present experimental results for a gene product’s role in a specific type of process, they then extrapolate to infer its role in other related processes.

Direct versus indirect effect. Many GO Biological Process annotations are assertions based upon mutant phenotypes. When annotating based upon mutant phenotype results, it can be difficult to discern if a gene product is directly involved in the process for which the authors screened (assayed) or if its absence instead results in an indirect or downstream effect. (For example, if any of the S. cerevisiae proteins involved in β€˜RNA splicing’ [GO:0008380] are mutated, translation is affected. This is a downstream effect because most of the genes encoding ribosomal proteins have introns (example, yeast ribosomal genes RPL2A, RPL2B, RPS11A, RPS11B) and if splicing genes are mutated, these ribosomal genes are not processed thereby affecting ribosomal assembly and hence translation.)

The β€˜response to’ GO terms are intended to annotate gene products that are required for the response to occur and are a direct result of the organism’s reaction to the stimuli (e.g. production of a gene product used to degrade a toxin or signaling to initiate immune cell division in response to a parasite).

Regulation of a biological process is defined as a role that modulates the frequency, rate or extent of that process.

Cellular component

Cellular Component describes locations, at the levels of subcellular structures and macromolecular complexes.

Code Walk-Through 🚢🏽

Let's say we have a person named Kit. Kit has been diagnosed with a rare form of brain cancer, and the SVAI community wants to help him out. Let's say we already obtained our list of genes that may be linked to Kit's brain cancer. As scientists and researchers, we might be curious about whether Kit's DNA has some clues to help us better understand his cancer.

First, we need to get the relevant GO annotations for Kit's type of cancer. Since he has a form of brain cancer, the most relevant types of data from the Broad Institute site are:

  • Glioblastoma multiforme

  • Glioma

Under these datasets, we specifically want annotated point mutations and indels with functional data relevant to cancer researchers. Annotations include gene names, functional consequence (e.g. Missense), PolyPhen-2 predictions, and cancer-specific annotations from resources such as COSMIC, Tumorscape, and published MutSig results. This type of annotation is available via:

  1. In the table, under the column 'Disease Name', located the row containing your disease of interest (e.g. for our example case, we're interest in brain cancers, Glioblastoma multiforme and Glioma).

  2. Then in the same row(s), click on the link under the 'Data' column.

4. Now you should be directed to a window that looks like this:

5. Since we're interested in the genes associated with cancer, and we've already chosen the mutated genes unique to the tumor sample, we want to know how the mutation may have affected our biological and molecular processes that contributed to the cancer development and progression. This relevant data is under the section 'Mutation Annotation File', attached to the file name 'Mutation_Packager_Oncotated_Calls'.

6. Now that we've downloaded and unzipped our data of interest, we can extract the relevant mutation data associated to our gene of interest. C3 team has a great example:

(TODO: Look at code carefully, and have a more guidance code walkthrough. If have time, convert code to Python)

library(data.table)
library(deconstructSigs)
library(BSgenome.Hsapiens.1000genomes.hs37d5)
library(ggplot2)
library(topGO)
library(org.Hs.eg.db)
library(gridExtra)

setwd("/media/avantika/My Passport/nf2hack/")

#Read point mutation data from TCGA
files_GBM = grep(".txt", list.files("/media/avantika/My Passport/nf2hack/gdac.broadinstitute.org_GBM.Mutation_Packager_Oncotated_Calls.Level_3.2016012800.0.0/", full.names=T), value=T)
files_LGG = grep(".txt", list.files("/media/avantika/My Passport/nf2hack/gdac.broadinstitute.org_LGG.Mutation_Packager_Oncotated_Calls.Level_3.2016012800.0.0/", full.names=T), value=T)

files = c(files_GBM, files_LGG)
cols = c("Hugo_Symbol", "Entrez_Gene_Id", "Chromosome", "Start_position", "End_position", "Strand",
         "Variant_Classification", "Variant_Type", "Reference_Allele", "Tumor_Seq_Allele2", 
         "GO_Biological_Process",	"GO_Molecular_Function")

#Select relevant fields and concatenate patient files
for(i in 1:length(files)){
  f = files[i]
  patient = strsplit(strsplit(f, split = "TCGA-")[[1]][2], split = ".hg19")[[1]][1]
  type = strsplit(strsplit(f, split = "broadinstitute.org_")[[1]][2], split = ".Mutation_Packager_")[[1]][1]
  patient = paste0(type, "_", patient)
  x = fread(f, header = T)
  x = subset(x, select = cols)
  x = cbind(patient, x)
  write.table(x, file = "muts.txt", append = T, quote = F, col.names = F, row.names = F, sep = "\t")
}

muts = fread("muts.txt")
setnames(muts, c("patient", "Hugo_Symbol", "Entrez_Gene_Id", "Chromosome", "Start_position", "End_position", "Strand",
                 "Variant_Classification", "Variant_Type", "Reference_Allele", "Tumor_Seq_Allele2", 
                 "GO_Biological_Process",	"GO_Molecular_Function"))
         
#Generate feature matrix for genes
gene_patient = dcast(unique(muts[,list(patient, Hugo_Symbol)]), Hugo_Symbol~patient, fun.aggregate = length, fill = 0)
write.table(gene_patient, file = "gene_patient.txt", quote = F, sep = "\t")

#Generate feature matrix for GO:biological annotation
go = unique(muts[,list(patient, Hugo_Symbol, Entrez_Gene_Id, GO_Biological_Process,	GO_Molecular_Function)])
biol = cbind(go[, list(patient, Hugo_Symbol, Entrez_Gene_Id)], as.data.table(tstrsplit(go$GO_Biological_Process, split = "\\|")))
biol = na.omit(melt(biol, id.vars = c(1,2,3), value.name = "go"))[,variable := NULL]
biol_patient = dcast(biol[,.N, by = list(go, patient)], go~patient, value.var = "N", fill=0)
write.table(biol_patient, file = "biol_patient.txt", quote=F, sep = "\t")

#Generate feature matrix for GO:molecular annotation
molec = cbind(go[, list(patient, Hugo_Symbol, Entrez_Gene_Id)], as.data.table(tstrsplit(go$GO_Molecular_Function, split = "\\|")))
molec = na.omit(melt(molec, id.vars = c(1,2,3), value.name = "go"))[,variable := NULL]
molec_patient = dcast(molec[,.N, by = list(go, patient)], go~patient, value.var = "N", fill=0)
write.table(molec_patient, file = "molec_patient.txt", quote=F, sep = "\t")

#Generate feature matrix for GO:biological enrichment
biol_go_ngenes = biol[, length(unique(Entrez_Gene_Id)), by=go]
biol_go = dcast(biol[, .N, by = list(patient, go)], go~patient, value.var = "N", fill = 0)
biol_go = merge(biol_go_ngenes, biol_go, by = "go")
setnames(biol_go, "V1", "n_genes_in_go")
biol_go = biol_go[n_genes_in_go>10]

ps = colnames(biol_go)[3:ncol(biol_go)]
n_genes_mut_p = colSums(biol_go[,3:ncol(biol_go)])
n_genes = length(unique(biol$Entrez_Gene_Id))

go_enr = data.table()

for(p in ps){
  n_genes_mut = n_genes_mut_p[p]
  n_genes_not_mut = n_genes - n_genes_mut
  
  x = biol_go[,list(go,n_genes_in_go,get(p))][V3>1]
  
  if(nrow(x)>0){
    fish = c()
    for(j in 1:nrow(x)){
      fish = c(fish, fisher.test(matrix(c(x$V3[j], x$n_genes_in_go[j]-x$V3[j], n_genes_mut, n_genes_not_mut), nrow=2))$p.value)
    }
    x$fish=fish
    x = x[fish<0.05]
    x$patient=p
  }
  
  if(p == ps[1]){
    go_enr = x
  }
  else if(nrow(x)>0){
    go_enr = rbind(go_enr,x)
  }
}

biol_enr_patient = dcast(go_enr, go~patient, value.var = "V3", fill=0)
write.table(biol_enr_patient, file = "biol_enr_patient.txt", quote=F, sep = "\t")

#Feature matrix for molecular GO enrichment
molec_go_ngenes = molec[, length(unique(Entrez_Gene_Id)), by=go]
molec_go = dcast(molec[, .N, by = list(patient, go)], go~patient, value.var = "N", fill = 0)
molec_go = merge(molec_go_ngenes, molec_go, by = "go")
setnames(molec_go, "V1", "n_genes_in_go")
molec_go = molec_go[n_genes_in_go>10]

ps = colnames(molec_go)[3:ncol(molec_go)]
n_genes_mut_p = colSums(molec_go[,3:ncol(molec_go)])
n_genes = length(unique(molec$Entrez_Gene_Id))

go_enr = data.table()

for(p in ps){
  n_genes_mut = n_genes_mut_p[p]
  n_genes_not_mut = n_genes - n_genes_mut
  
  x = molec_go[,list(go,n_genes_in_go,get(p))][V3>1]
  
  if(nrow(x)>0){
    fish = c()
    for(j in 1:nrow(x)){
      fish = c(fish, fisher.test(matrix(c(x$V3[j], x$n_genes_in_go[j]-x$V3[j], n_genes_mut, n_genes_not_mut), nrow=2))$p.value)
    }
    x$fish=fish
    x = x[fish<0.05]
    x$patient=p
  }
  
  if(p == ps[1]){
    go_enr = x
  }
  else if(nrow(x)>0){
    go_enr = rbind(go_enr,x)
  }
}

molec_enr_patient = dcast(go_enr, go~patient, value.var = "V3", fill=0)
write.table(molec_enr_patient, file = "molec_enr_patient.txt", quote=F, sep = "\t")

#Generate feature matrix for sequence based clustering
mut = muts[Variant_Type=="SNP", list(patient, Chromosome, Start_position, Reference_Allele, Tumor_Seq_Allele2)]
setnames(mut, c("sample", "chr", "pos", "ref", "alt"))
sigs_input = mut.to.sigs.input(mut.ref = mut, 
                               sample.id = "sample", chr = "chr", pos = "pos", ref = "ref", alt = "alt", 
                               bsg = BSgenome.Hsapiens.1000genomes.hs37d5)
muttype_patient = data.table(t(sigs_input), keep.rownames = T)
setnames(muttype_patient, "rn", "Variant_Type")
svtype_patient = as.data.table(dcast(muts[Variant_Type %in% c("DEL", "INS"), .N, by = list(patient, Variant_Type)], Variant_Type~patient, value.var = "N", fill=0))
muttype_patient = rbind(muttype_patient, svtype_patient, fill = TRUE)

muttype_patient = melt(muttype_patient, id.vars = 1, variable.name = "patient", value.name = "N")
muttype_patient[is.na(N), N:=0]
muttype_patient[, N.patient := sum(N), by = patient]
muttype_patient[, freq := N/N.patient]

muttype_patient = dcast(muttype_patient, Variant_Type~patient, value.var = "freq")
write.table(muttype_patient, file = "muttype_patient.txt", quote=F, sep="\t")


#Find mutational signatures in TCGA patients
mut2 = as.data.frame(muts[Variant_Type=="SNP", list(patient, Chromosome, Start_position, Reference_Allele, Tumor_Seq_Allele2)])
colnames(mut2) = c("sample", "chr", "pos", "ref", "alt")
sigs_input = mut.to.sigs.input(mut.ref = mut2, 
                               sample.id = "sample", chr = "chr", pos = "pos", ref = "ref", alt = "alt", 
                               bsg = BSgenome.Hsapiens.1000genomes.hs37d5)
sigs = NULL
for(i in 1:nrow(sigs_input)){
  if(i==1){
    sigs = whichSignatures(sigs_input, rownames(sigs_input)[i], signatures.ref = signatures.cosmic, contexts.needed = T)$weights
  }
  else{
    sigs = rbind(sigs, whichSignatures(sigs_input, rownames(sigs_input)[i], signatures.ref = signatures.cosmic, contexts.needed = T)$weights)
    
  }
}

#Generate feature matrix for mutational signatures
sigs_patients = t(sigs)
write.table(sigs_patients, file = "sigs_patients.txt", quote = F, sep = "\t")

#Filter coding mutations for Onno

load("exons.RData")
exons = unique(exons[, list(chrom, start, end)])
mut_o = as.data.table(read.csv("VCF_filtered_GQ_thresh.csv"))
mut_o = mut_o[, c("CHROM", "POS", "REF", "ALT")]
mut_o = mut_o[nchar(as.character(REF))==1 & nchar(as.character(ALT))==1]

mut_o_filt = mut_o[2,]
for(i in 3:nrow(mut_o)){
  if(i%%100==0){print(i)}
  if(nrow(exons[chrom==mut_o[i, CHROM] & start <= mut_o[i, POS] & end >= mut_o[i, POS]])>0){
    mut_o_filt = rbind(mut_o_filt, mut_o[i,])
  }
}

mut_o_filt[, sample := "onno"]
mut_o_filt = as.data.frame(mut_o_filt)

write.table(mut_o_filt, file = "mut_o_filt.txt", quote=F, sep="\t", row.names = F)

#Find signatures for Onno
sigs_input = mut.to.sigs.input(mut.ref = mut_o_filt, 
                               sample.id = "sample", chr = "CHROM", pos = "POS", ref = "REF", alt = "ALT")
sigs = whichSignatures(sigs_input, "onno", signatures.ref = signatures.cosmic, contexts.needed = T)$weights
sigs_onno = t(sigs)
write.table(sigs_onno, file = "sigs_onno.txt", quote = F, sep = "\t")

#Find mutated genes for onno
load("genes.RData")
genes = unique(genes[, list(name2, chrom, cdsStart, cdsEnd)])
mut_o_filt = as.data.table(mut_o_filt)

onno_genes = c()
for(i in 1:nrow(mut_o_filt)){
  x = genes[chrom==mut_o_filt[i, CHROM] & cdsStart <= mut_o_filt[i, POS] & cdsEnd >= mut_o_filt[i, POS]]
  if(nrow(x)>0){
    onno_genes = c(onno_genes, as.character(x$name2))
  }
}
onno_genes = unique(onno_genes)

onno_biol = biol[Hugo_Symbol %in% onno_genes, .N, by = go]
onno_biol = merge(biol_patient, onno_biol, by = "go", all.x=T)[, list(go, N)]
onno_biol[is.na(N), N:=0]
write.table(onno_biol, file = "onno_biol.txt", quote=F, col.names = F, row.names = F, sep="\t")


biol_onno = t(as.data.table(onno_biol))
write.table(biol_onno, file = "biol_onno.txt", quote=F, col.names = F, row.names = F, sep="\t")

Separate Python version (create code for 1 + 2, talk through 3):

  1. import dataset of interest, including reference

  2. grab cols for feature matrices, save the files

  3. Steps forwards with feature matrices: generate mutation signatures via deconstruct_sigs_py (installing is currently broken), downstream PCA and cluster and find the cluster that most identified with; consider additional pathways for the other feature matrices

    1. Explain out the different feature matrices to

  4. Look at best practices applied with C3's approach to classifying the signature and deriving molecular relevance: https://github.com/SVAI/C3/blob/master/C3finalpresentation.pdf

    1. (consider if t-SNE is a better classifier)

Broad firehose database is best to gather relevant cancer data.

Last updated