diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Failed building databases using GTDB-taxdump taxonomy files

Open emilhaegglund opened this issue 2 years ago • 2 comments

I was trying to build a database using the taxonomy files from gtdb-taxdump, however it failed when reading the names.dmp with the following message:

zcat  gtdb_proteomes/* | diamond makedb --db gtdb --taxonnames gtdb-taxdump/R207/names.dmp --taxonnodes gtdb_data/gtdb-taxdump/R207/nodes.dmp --taxonmap gtdb.protein.taxid.map

diamond v2.0.15.153 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 32
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Input file parameter (--in) is missing. Input will be read from stdin.
Opening the database file...  [0s]
Loading sequences...  [0.772s]
Masking sequences...  [0.157s]
Writing sequences...  [0.035s]
Writing accessions...  [0.072s]
Hashing sequences...  [0.013s]
Loading sequences...  [0s]
Writing trailer...  [0.003s]
Loading taxonomy nodes...  [28.213s]
Loading taxonomy names...  [78.105s]
Failed to allocate sufficient memory. Please refer to the manual for instructions on memory usage.

Here is an example of the names.dmp from gtdb-taxdump

head -20 gtdb-taxdump/R207/names.dmp
1	|	root	|		|	scientific name	|
13926	|	001393675	|		|	scientific name	|
14375	|	RUG14239 sp902797145	|		|	scientific name	|
17689	|	001423155	|		|	scientific name	|
20514	|	018334475	|		|	scientific name	|
23859	|	013185635	|		|	scientific name	|
34402	|	002214285	|		|	scientific name	|
38289	|	001509495	|		|	scientific name	|
66445	|	009903045	|		|	scientific name	|
74747	|	000419015	|		|	scientific name	|
78978	|	014222245	|		|	scientific name	|
85313	|	001742655	|		|	scientific name	|
88808	|	E44-bin52 sp004375875	|		|	scientific name	|
121310	|	001585965	|		|	scientific name	|
138721	|	VXYK01	|		|	scientific name	|
147972	|	007121265	|		|	scientific name	|
151528	|	007830495	|		|	scientific name	|
157756	|	003411905	|		|	scientific name	|
160336	|	002878095	|		|	scientific name	|
173955	|	001247185	|		|	scientific name	|

Do you have any idea why this could be happening? I haven't had any problems building databases with the NCBI-taxdumps.

Best regards, Emil Hägglund

emilhaegglund avatar Sep 19 '22 12:09 emilhaegglund

The taxids used in these files are > 2^31, that is not supported at the moment. I'll see what I can do about this.

bbuchfink avatar Sep 19 '22 12:09 bbuchfink

Ah, suspected it was something like this. Then I know the cause of the error. Thanks for the quick reply!

emilhaegglund avatar Sep 19 '22 17:09 emilhaegglund