diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Questions regarding diamond database incompatibilities and updates

Open bernt-matthias opened this issue 1 month ago • 9 comments

I got the error:

Opening the database... Error: Options require taxonomy information included in the database. Please use the respective options to build this information into the database when running diamond makedb: taxonomy names information (--taxonnames option), taxonomy ranks information (database needs to be built with diamond version >= 0.9.30

I have built all my diamond databases for our institute using 0.9.21 or older (clearly I should have used a more recent version). The release notes for 0.9.3x contain a note (https://github.com/bbuchfink/diamond/releases/tag/v0.9.36) and I'm wondering if this also affects 0.9.21. My DBs range from (Database format version 2, Diamond build 123 and Database format version 3, Diamond build 162).

Would it be safe (in terms of reproducibility) to rerun database creation with a recent version (I still have the fasta and tax data).

Since I'm also a maintainer of the diamond Galaxy tools I was wondering if there is a compatibility matrix of the database format versions / Diamond builds used for the generation and the diamond versions using the DBs?

bernt-matthias avatar Nov 12 '25 13:11 bernt-matthias

Should be safe to rebuild yes, as long as the sequences are in the same order it should not affect results in any way.

This is what I have ready: https://github.com/bbuchfink/diamond/wiki/5.-Advanced-topics#database-format-versions

I also plan to deprecate the .dmnd format in favor of BLAST databases, since taxonomy is now also supported. I don't plan to remove support for the format from the code.

bbuchfink avatar Nov 12 '25 14:11 bbuchfink

Thanks a lot for the prompt answer. Which BLAST db version is the smallest supported?

bernt-matthias avatar Nov 12 '25 15:11 bernt-matthias

v4 should be ok, never heard of earlier versions still being used.

bbuchfink avatar Nov 12 '25 15:11 bbuchfink

Cool - then I will try to implement blast DB support in the Galaxy tools.

I tried (using diamond 2.1.16 and blast 2.17.0) to create a custom blast DB using makeblastdb -in db.fasta -dbtype prot -parse_seqids -taxid_map taxid

diamond blastp --db database --query 'protein.fasta' --no-self-hits --taxonlist 5524211

and

diamond blastp --db database --query 'protein.fasta' --no-self-hits   --outfmt '6' qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore scovhsp sskingdoms skingdoms sphylums cigar 

result in Error: std::exception

Any idea what I'm doing wrong?

If I try with swissprot (obtained with update_blastdb.pl --decompress swissprot)

diamond blastp --db swissprot --query '/tmp/tmp6ui6ez_w/files/0/6/1/dataset_06195bff-b8d2-4dae-a8a6-a6ebf68f2681.dat' --no-self-hits   --outfmt '6' qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore scovhsp sskingdoms skingdoms sphylums cigar

gives me:

Error: Taxonomy rank information (nodes.dmp) is missing in search path (/tmp/tmp6ui6ez_w/job_working_directory/000/2/working). Download and extract this file in the database directory: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip

The files used are these: https://github.com/galaxyproject/tools-iuc/blob/main/tools/diamond/test-data/db.fasta https://github.com/galaxyproject/tools-iuc/blob/main/tools/diamond/test-data/protein.fasta

bernt-matthias avatar Nov 15 '25 11:11 bernt-matthias

Can you also provide your taxid map so I can try to reproduce this?

The second error can be fixed by doing as the message says. The information is not contained in the binary files by NCBI, so you need an extra download for this.

bbuchfink avatar Nov 15 '25 13:11 bbuchfink

Just found out that the problem seems to be that database is a symlink to the BLAST db. If I use the full path I get the problem about missing nodes.dmp (which I now know how to fix ... thanks for this).

Just in case:

Can you also provide your taxid map so I can try to reproduce this?

https://gist.github.com/bernt-matthias/a239161e5783c45ba85213bcf8d0c5b1

bernt-matthias avatar Nov 16 '25 11:11 bernt-matthias

Do you have any suggestions for building small versions of the NCBI taxdb files (.bti / .btd)? I have a small list of taxids and I was hoping to construct minimal testdata.

bernt-matthias avatar Nov 20 '25 18:11 bernt-matthias

got this from chatgpt: https://gist.github.com/bbuchfink/66df4adf1e3642d3412a51396aaa0d71

bbuchfink avatar Nov 21 '25 07:11 bbuchfink

Thanks. I tried around with chatgpt wothout much luck. Then I wrote a mail to NCBI and got an answer:

docs are here https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objtools/blast/seqdb_reader/isam_files.txt

They also shared small example files. I could also share it here.

bernt-matthias avatar Nov 25 '25 13:11 bernt-matthias