bioperl-live icon indicating copy to clipboard operation
bioperl-live copied to clipboard

bp_genbank2gff3.pl parsing issue

Open rbutleriii opened this issue 9 years ago • 2 comments

Hello, When parsing a .gbf generated by tbl2asn, such as one that is being prepared for submission, the ACCESSION fields for each locus will be empty (not assigned by genbank yet). bp_genbank2gff3.pl assigns the "unknown" value as the region ID for all of the loci.

Contig_1 GenBank region 1 1627 . + 1 ID=unknown;Dbxref=BioProject:###########;Name=unknown;Note=Clostridium sporogenes.,clade I;isolate=2007;mol_type=genomic DNA;organism=Clostridium sporogenes

This creates an issue for downstream parsing as all the nucleotide fasta headers at the bottom of the file are the same (>unknown). Can the script be modified to either number them uniquely or else us the LOCUS value when an ACCESSION is not available?

rbutleriii avatar Jul 17 '16 05:07 rbutleriii

The second option (using the LOCUS) may be easier to implement. Will have to see if this can be done prior to the next release or not.

cjfields avatar Jul 18 '16 18:07 cjfields

Sounds Good.

rbutleriii avatar Jul 18 '16 19:07 rbutleriii