biocode
biocode copied to clipboard
convert_gff3_to_ncbi_tbl
Can someone tell me which assumption convert_gff3_to_ncbi_tbl makes on the formatting of the names? Apparently ours miss something:
python3 gff/convert_gff3_to_ncbi_tbl.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.tbl -ln LAB -nap NAP -gf ../juncus.fasta
Traceback (most recent call last):
File "gff/convert_gff3_to_ncbi_tbl.py", line 89, in <module>
main()
File "gff/convert_gff3_to_ncbi_tbl.py", line 82, in main
tbl.print_tbl_from_assemblies(assemblies=assemblies, ofh=ofh, go_obo=args.go_obo, lab_name=args.lab_name)
File "/tmp/biocode/lib/biocode/tbl.py", line 95, in print_tbl_from_assemblies
print_biogene(gene=gene, fh=ofh, obo_dict=go_idx, lab_name=lab_name)
File "/tmp/biocode/lib/biocode/tbl.py", line 122, in print_biogene
raise Exception("ERROR: locus_tag attributes are required for all gene elements (gene id: {0}".format(gene.id))
Exception: ERROR: locus_tag attributes are required for all gene elements (gene id: Transcript_32960|g.33387
ping @arsilan324
Just remembered add_gff3_locus_tags.py
. But apparently some entries in the gff file dont get a locus_tag. I'm using this command line:
python3 gff/add_gff3_locus_tags.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.lt.gff3 -p PREFIX -a 10
Home from my conference and travels. Will get to these tickets later today, just FYI.
I have tracked down this issue. The problem is with the library's treatment of genes with multiple isoforms. When it sees more than one mRNA for a particular gene, it's currently spawning off another gene and attaching the mRNA to that one, flattening out the gene/mRNA relationships. I can find no justification of why this was the decided behavior (after about an hour spent tonight searching through fun archives of e-mails with NCBI staff when submitting eukaryotic genomes.)
Your file has 95,646 genes and 120,335 mRNAs, so multiple isoforms are common. What was a little surprising was that the mRNA, CDS and exon count are all 120,335. At first I thought it strange that all your genes were single-exon genes, then realized the source (transdecoder) implied these were from Trinity. So you're doing in this in preparation for tbl2asn running for transcriptome submission.
I'll fix this so that proper gene representation is done when more than one mRNA is present. If you haven't already, it would be good to review the submission guidelines to see if there are any transcriptome-specific format details. I'll be happy to add any you uncover.
Wonderful. Please send me a ping here, then I can try.
I guess @arsilan324 can say about if the the counts of genes, mRNA, CDS, and exons are reasonable.
According to Brian Haas (Transdecoder developer): In the data model of transdecoder, each CDS (and corresponding exon) is tied to it's own mRNA, and a single gene is allowed to produce multiple mRNAs. It doesn't allow for the single mRNA, multi-CDS arrangement (ie. doesn't do operons).