Questions regarding diamond database incompatibilities and updates
I got the error:
Opening the database... Error: Options require taxonomy information included in the database. Please use the respective options to build this information into the database when running diamond makedb: taxonomy names information (--taxonnames option), taxonomy ranks information (database needs to be built with diamond version >= 0.9.30
I have built all my diamond databases for our institute using 0.9.21 or older (clearly I should have used a more recent version). The release notes for 0.9.3x contain a note (https://github.com/bbuchfink/diamond/releases/tag/v0.9.36) and I'm wondering if this also affects 0.9.21. My DBs range from (Database format version 2, Diamond build 123 and Database format version 3, Diamond build 162).
Would it be safe (in terms of reproducibility) to rerun database creation with a recent version (I still have the fasta and tax data).
Since I'm also a maintainer of the diamond Galaxy tools I was wondering if there is a compatibility matrix of the database format versions / Diamond builds used for the generation and the diamond versions using the DBs?
Should be safe to rebuild yes, as long as the sequences are in the same order it should not affect results in any way.
This is what I have ready: https://github.com/bbuchfink/diamond/wiki/5.-Advanced-topics#database-format-versions
I also plan to deprecate the .dmnd format in favor of BLAST databases, since taxonomy is now also supported. I don't plan to remove support for the format from the code.
Thanks a lot for the prompt answer. Which BLAST db version is the smallest supported?
v4 should be ok, never heard of earlier versions still being used.
Cool - then I will try to implement blast DB support in the Galaxy tools.
I tried (using diamond 2.1.16 and blast 2.17.0) to create a custom blast DB using makeblastdb -in db.fasta -dbtype prot -parse_seqids -taxid_map taxid
diamond blastp --db database --query 'protein.fasta' --no-self-hits --taxonlist 5524211
and
diamond blastp --db database --query 'protein.fasta' --no-self-hits --outfmt '6' qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore scovhsp sskingdoms skingdoms sphylums cigar
result in Error: std::exception
Any idea what I'm doing wrong?
If I try with swissprot (obtained with update_blastdb.pl --decompress swissprot)
diamond blastp --db swissprot --query '/tmp/tmp6ui6ez_w/files/0/6/1/dataset_06195bff-b8d2-4dae-a8a6-a6ebf68f2681.dat' --no-self-hits --outfmt '6' qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore scovhsp sskingdoms skingdoms sphylums cigar
gives me:
Error: Taxonomy rank information (nodes.dmp) is missing in search path (/tmp/tmp6ui6ez_w/job_working_directory/000/2/working). Download and extract this file in the database directory: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.zip
The files used are these: https://github.com/galaxyproject/tools-iuc/blob/main/tools/diamond/test-data/db.fasta https://github.com/galaxyproject/tools-iuc/blob/main/tools/diamond/test-data/protein.fasta
Can you also provide your taxid map so I can try to reproduce this?
The second error can be fixed by doing as the message says. The information is not contained in the binary files by NCBI, so you need an extra download for this.
Just found out that the problem seems to be that database is a symlink to the BLAST db. If I use the full path I get the problem about missing nodes.dmp (which I now know how to fix ... thanks for this).
Just in case:
Can you also provide your taxid map so I can try to reproduce this?
https://gist.github.com/bernt-matthias/a239161e5783c45ba85213bcf8d0c5b1
Do you have any suggestions for building small versions of the NCBI taxdb files (.bti / .btd)? I have a small list of taxids and I was hoping to construct minimal testdata.
got this from chatgpt: https://gist.github.com/bbuchfink/66df4adf1e3642d3412a51396aaa0d71
Thanks. I tried around with chatgpt wothout much luck. Then I wrote a mail to NCBI and got an answer:
docs are here https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objtools/blast/seqdb_reader/isam_files.txt
They also shared small example files. I could also share it here.