Error after making Diamond db of IMG/VR dataset
Hello and thanks for developing this Diamond!!
I am currently in the process of preparing a Diamond database from IMG/VR data. I have already prepared the necessary files and successfully created the name and node files, as well as the taxid.map file.
I have attached the name and node files for your reference. The taxid.map file is quite large, but I have included the head of it in the links below: names.dmp.txt nodes.dmp.txt head_id2tax.txt
Diamond making db:
diamond makedb --in ../IMGVR_all_proteins.faa --db IMGVR_Taxoned --taxonmap example/taxdump/taxid-h.map --taxonnodes example/taxdump/nodes.dmp --taxonnames example/taxdump/names.dmp -p 40
Its end of diamond stdout:
Loading taxonomy names... [0.021s]
Loaded taxonomy names for 0 taxon ids.
Loading taxonomy mapping file... [355.924s]
Joining accession mapping... [82.928s]
Writing taxon id list... [2.292s]
Building taxonomy nodes... [11.996s]
2147258865 taxonomy nodes processed.
Number of nodes assigned to rank:
no rank 2147251888
superkingdom 6
kingdom 10
subkingdom 0
superphylum 0
phylum 17
subphylum 0
superclass 39
class 0
subclass 0
infraclass 0
cohort 0
subcohort 0
superorder 0
order 64
suborder 0
infraorder 0
parvorder 0
superfamily 0
family 206
subfamily 0
tribe 0
subtribe 0
genus 2116
subgenus 0
section 0
subsection 0
series 0
species group 0
species subgroup 0
species 4519
subspecies 0
varietas 0
forma 0
strain 0
biotype 0
clade 0
forma specialis 0
genotype 0
isolate 0
morph 0
pathogroup 0
serogroup 0
serotype 0
subvariety 0
Closing the input file... [0s]
Closing the database file... [0.015s]
Database sequences 220799163
Database letters 49459660621
Accessions in database 220799163
Entries in accession to taxid file 216984561
Database accessions mapped to taxid 0
Database sequences mapped to taxid 0
Database hash 58748cfe915c91e69a43a88c27aa3e8b
Total time 2155s
I think the issue could be that I have a large number of sequences for each taxonomy id. And it seems that only one of each has been identified (or indexed) by Diamond. for example:
accession.version taxid
IMGVR_UViG_638276111_000001|638276111|638297712 541518477
IMGVR_UViG_638276111_000001|638276111|638297713 541518477
IMGVR_UViG_638276111_000001|638276111|638297714 541518477
IMGVR_UViG_638276111_000001|638276111|638297715 541518477
IMGVR_UViG_638276111_000001|638276111|638297716 541518477
IMGVR_UViG_638276111_000001|638276111|638297717 541518477
IMGVR_UViG_638276111_000001|638276111|638297718 541518477
IMGVR_UViG_638276111_000001|638276111|638297719 541518477
IMGVR_UViG_638276111_000001|638276111|638297720 541518477
IMGVR_UViG_638276111_000001|638276111|638297721 541518477
IMGVR_UViG_638276111_000001|638276111|638297722 541518477
IMGVR_UViG_638276111_000001|638276111|638297723 541518477
Hope there is some way to fix it! I would greatly appreciate your assistance if you have any suggestions or solutions.
Update: I've explained the reason and solution here, in case it might solve someone else's issue.
Good luck, NP
Sorry for this unfortunate issue, this was implemented to handle old NCBI headers. I think at least a warning message should be given in these cases.
The latest release now prints a warning message about this when you run makedb.