dammit
dammit copied to clipboard
parsing output to make transcript-gene.csv table for tximport
Can dammit output this table?
A table is required by tximport to convert between transcript ID and gene
Made by tutorial:
import pandas as pd
from dammit.fileio.gff3 import GFF3Parser
gff_file = "trinity.nema.fasta.dammit.gff3"
annotations = GFF3Parser(filename=gff_file).read()
names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']]
new_file = names.dropna(axis=0,how='all')
new_file.head()
new_file.to_csv("nema_gene_name_id.csv")
exit()
Used for tximport:
tx2gene <- read.csv("~/nema_gene_name_id.csv")
tx2gene <- tx2gene[,c(2,3)]
cols<-c("transcript_id","gene_id")
colnames(tx2gene)<-cols
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene,importer=read.delim)
People won't always want to grab the lowest E-value to pick the best annotation. Another option for this table includes only genes from a custom database. Or the longest transcript. Many more ways to do this.
Some notebooks attempting to parse and make this table:
Easiest thing to do might be to grab lowest E-value then make transcript-gene.csv table for tximport. Later can add more options.