dammit icon indicating copy to clipboard operation
dammit copied to clipboard

parsing output to make transcript-gene.csv table for tximport

Open johnsolk opened this issue 5 years ago • 0 comments

Can dammit output this table?

A table is required by tximport to convert between transcript ID and gene

Example table file

Made by tutorial:

import pandas as pd
from dammit.fileio.gff3 import GFF3Parser
gff_file = "trinity.nema.fasta.dammit.gff3"
annotations = GFF3Parser(filename=gff_file).read()
names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']]
new_file = names.dropna(axis=0,how='all')
new_file.head()
new_file.to_csv("nema_gene_name_id.csv")
exit()

Used for tximport:

tx2gene <- read.csv("~/nema_gene_name_id.csv")
tx2gene <- tx2gene[,c(2,3)]
cols<-c("transcript_id","gene_id")
colnames(tx2gene)<-cols
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene,importer=read.delim)

People won't always want to grab the lowest E-value to pick the best annotation. Another option for this table includes only genes from a custom database. Or the longest transcript. Many more ways to do this.

Some notebooks attempting to parse and make this table:

Easiest thing to do might be to grab lowest E-value then make transcript-gene.csv table for tximport. Later can add more options.

johnsolk avatar Dec 12 '18 04:12 johnsolk