bioperl-live icon indicating copy to clipboard operation
bioperl-live copied to clipboard

‎The DDBJ/ENA/GenBank accession number change

Open heikkil opened this issue 6 years ago • 2 comments

https://ncbiinsights.ncbi.nlm.nih.gov/2018/12/03/adapting-flatfile-parsers-genbank-new-accession-formats/

"the LOCUS line, includes the “Locus Name” (usually identical to the accession number), which may now grow to as long as 20 characters."

"See section 3.4.4 of the GenBank release notes for examples of how the LOCUS line might change." https://ftp.ncbi.nlm.nih.gov/genbank/gbrel.txt

From our internal testing, it appears BioPython and BioPerl properly handle most of the examples shown in section 3.4.4, and only have issues with the last theoretical examples where the sequence length no longer ends at position 40. We do recommend adjusting code to accommodate those theoretical examples for future-proofing.

https://ncbiinsights.ncbi.nlm.nih.gov/2018/09/19/genbank-expanded-accession-formats/

https://ftp.ncbi.nlm.nih.gov/genbank/gbrel.txt 1.4 Upcoming Changes 1.4.1 Changes to nucleotide and protein accession formats By the end of 2018 the INSDC members plan to expand this format, using a six-letter Project Code prefix, two-digit Assembly-Version number, followed by 7, 8, or 9 digits. An example of such an accession is AAAAAA020000001 .

heikkil avatar Dec 04 '18 10:12 heikkil

@heikkil just came here to add the same thing 😄

cjfields avatar Dec 04 '18 19:12 cjfields

Cross reference https://github.com/biopython/biopython/issues/1870

peterjc avatar Dec 05 '18 17:12 peterjc