biocode icon indicating copy to clipboard operation
biocode copied to clipboard

convert_gff3_to_ncbi_tbl

Open bernt-matthias opened this issue 7 years ago • 5 comments

Can someone tell me which assumption convert_gff3_to_ncbi_tbl makes on the formatting of the names? Apparently ours miss something:

python3 gff/convert_gff3_to_ncbi_tbl.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.tbl -ln LAB -nap NAP -gf ../juncus.fasta 
Traceback (most recent call last):
  File "gff/convert_gff3_to_ncbi_tbl.py", line 89, in <module>
    main()
  File "gff/convert_gff3_to_ncbi_tbl.py", line 82, in main
    tbl.print_tbl_from_assemblies(assemblies=assemblies, ofh=ofh, go_obo=args.go_obo, lab_name=args.lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 95, in print_tbl_from_assemblies
    print_biogene(gene=gene, fh=ofh, obo_dict=go_idx, lab_name=lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 122, in print_biogene
    raise Exception("ERROR: locus_tag attributes are required for all gene elements (gene id: {0}".format(gene.id))
Exception: ERROR: locus_tag attributes are required for all gene elements (gene id: Transcript_32960|g.33387

ping @arsilan324

bernt-matthias avatar Feb 13 '18 10:02 bernt-matthias

Just remembered add_gff3_locus_tags.py. But apparently some entries in the gff file dont get a locus_tag. I'm using this command line:

python3 gff/add_gff3_locus_tags.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.lt.gff3 -p PREFIX -a 10

bernt-matthias avatar Feb 13 '18 11:02 bernt-matthias

Home from my conference and travels. Will get to these tickets later today, just FYI.

jorvis avatar Feb 15 '18 16:02 jorvis

I have tracked down this issue. The problem is with the library's treatment of genes with multiple isoforms. When it sees more than one mRNA for a particular gene, it's currently spawning off another gene and attaching the mRNA to that one, flattening out the gene/mRNA relationships. I can find no justification of why this was the decided behavior (after about an hour spent tonight searching through fun archives of e-mails with NCBI staff when submitting eukaryotic genomes.)

Your file has 95,646 genes and 120,335 mRNAs, so multiple isoforms are common. What was a little surprising was that the mRNA, CDS and exon count are all 120,335. At first I thought it strange that all your genes were single-exon genes, then realized the source (transdecoder) implied these were from Trinity. So you're doing in this in preparation for tbl2asn running for transcriptome submission.

I'll fix this so that proper gene representation is done when more than one mRNA is present. If you haven't already, it would be good to review the submission guidelines to see if there are any transcriptome-specific format details. I'll be happy to add any you uncover.

jorvis avatar Feb 16 '18 06:02 jorvis

Wonderful. Please send me a ping here, then I can try.

I guess @arsilan324 can say about if the the counts of genes, mRNA, CDS, and exons are reasonable.

bernt-matthias avatar Feb 16 '18 08:02 bernt-matthias

According to Brian Haas (Transdecoder developer): In the data model of transdecoder, each CDS (and corresponding exon) is tied to it's own mRNA, and a single gene is allowed to produce multiple mRNAs. It doesn't allow for the single mRNA, multi-CDS arrangement (ie. doesn't do operons).

arsilan324 avatar Feb 16 '18 14:02 arsilan324