diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Error: Accession exceeds supported length

Open nick-youngblut opened this issue 5 years ago • 7 comments

I'd like to use diamond makedb on a custom taxonomy created from the GTDB (via gtdb_to_taxdump). With diamond v0.9.30.131, I'm getting the error: Error: Accession exceeds supported length. The GTDB accessions include a prefix (eg., GB_GCA_002778965.1), which is likely causing the issue.

For now, I'll just strip off the prefixes in the fasta, acc2taxid, and names.dmp files. It would be great if future versions of diamond allowed for such prefixes in the accessions.

nick-youngblut avatar Mar 31 '20 11:03 nick-youngblut

Hi Nick, I'm aware of this issue and I'll try to support a dynamic length in future versions. For now, you can also change the max length by editing src/data/taxonomy.h:45 and setting enum { max_accesion_len = 14 }; to a higher value.

bbuchfink avatar Mar 31 '20 12:03 bbuchfink

Thanks for letting me know how to get around the issue! I've already converted all of the accessions (eg., GB_GCA_002778965.1 to GCA002778965.1), with works for diamond makedb v0.9.30.131.

nick-youngblut avatar Mar 31 '20 12:03 nick-youngblut

diamond makedb is stating that my taxonomy includes a lot of "no rank" nodes:

[...]
Accession mappings = 24706
Loading taxonomy nodes...  [0.608s]
Loading taxonomy names...  [0.517s]
Loaded taxonomy names for 182187 taxon ids.
Writing taxon id lists...  [17.373s]
82930832 sequences mapped to taxonomy, 82930832 total mappings.
Building taxonomy nodes...  [0.001s]
180131 taxonomy nodes processed.
Number of nodes assigned to rank:
no rank           140636
superkingdom      2
kingdom           0
subkingdom        0
superphylum       0
phylum            151
subphylum         0
superclass        0
class             152
subclass          0
infraclass        0
cohort            0
subcohort         0
superorder        0
order             158
suborder          0
infraorder        0
parvorder         0
superfamily       0
family            159
subfamily         0
tribe             0
subtribe          0
genus             169
subgenus          0
section           0
subsection        0
series            0
species group     0
species subgroup  0
species           170
subspecies        38534
varietas          0
forma             0

However, most of the nodes in my nodes.dmp files are "subspecies" (n=145904). Any ideas on how to troubleshoot this? Does diamond expect anything special in the names.dmp file? My custom names.dmp file is rather minimal.

nick-youngblut avatar Mar 31 '20 14:03 nick-youngblut

If you make the files available to me, I can look into it.

bbuchfink avatar Mar 31 '20 14:03 bbuchfink

I've copied the files to /tmp/global2/nyoungblut/gtdb_diamond_db/. I believe that you have read access. The nodes/names dump files were created with my simple code in gtdb_to_taxdump. The screenlog.0 file shows the log for my diamond makedb job. Thanks!!

nick-youngblut avatar Mar 31 '20 14:03 nick-youngblut

This happens because the nodes.dmp is implicitly assumed to be sorted on the taxid. I will remove this restriction in a future release, but for now you can simply sort your file.

bbuchfink avatar Apr 01 '20 15:04 bbuchfink

That worked! Thanks for looking into the problem! I've changed my gtdb_to_taxdump code to order the nodes.dmp (and names.dmp) by taxID, so no worries about changing diamond

nick-youngblut avatar Apr 02 '20 06:04 nick-youngblut