gtdb_to_taxdump
gtdb_to_taxdump copied to clipboard
KeyError: 'Cannot find GCA003697015.1 accession in names.dmp'
Hi @nick-youngblut,
when I try to build the gtdb database using r207, I get:
gtdb_to_diamond.py -o gtdb gtdb_proteins_aa_reps_r207.tar.gz taxdump/names.dmp taxdump/nodes.dmp
2023-08-23 13:53:35,547 - Read nodes.dmp file: taxdump/nodes.dmp
2023-08-23 13:53:35,813 - File written: gtdb/nodes.dmp
2023-08-23 13:53:35,813 - Reading dumpfile: taxdump/names.dmp
2023-08-23 13:53:37,103 - File written: gtdb/names.dmp
2023-08-23 13:53:37,103 - No. of accession<=>taxID pairs: 398700
2023-08-23 13:53:37,104 - Extracting tarball: gtdb_proteins_aa_reps_r207.tar.gz
2023-08-23 13:53:37,104 - Extracting to: gtdb_to_diamond_TMP
2023-08-23 14:13:05,126 - No. of .faa(.gz) files: 65703
2023-08-23 14:13:05,150 - Creating accession2taxid table...
Traceback (most recent call last):
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 79, in accession2taxid
taxID = names_dmp[accession]
KeyError: 'GCA003697015.1'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 146, in <module>
main(args)
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 135, in main
accession2taxid(names_dmp, faa_files, args.outdir)
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 82, in accession2taxid
raise KeyError(msg.format(accession))
KeyError: 'Cannot find GCA003697015.1 accession in names.dmp'
I've downloaded names.dmp
and nodes.dmp
from here and gtdb_proteins_aa_reps_r207.tar.gz
from here.
Any help is highly appreciated,
Bastian
I've changed the keyerror to a warning, which should provide more info on whether there are many non-overlapping accessions between the tarball and dmp files, or if it is just GCA003697015.1
. Run the command again and see how many warnings that you get.
After updating gtdb_to_diamond.py
I get following error:
gtdb_to_diamond.py -o gtdb_vers207 gtdb_proteins_aa_reps_r207.tar.gz taxdump/names.dmp taxdump/nodes.dmp
2023-08-24 09:28:58,061 - Read nodes.dmp file: taxdump/nodes.dmp
2023-08-24 09:28:58,616 - File written: gtdb_vers207/nodes.dmp
2023-08-24 09:28:58,616 - Reading dumpfile: taxdump/names.dmp
2023-08-24 09:29:01,492 - File written: gtdb_vers207/names.dmp
2023-08-24 09:29:01,492 - No. of accession<=>taxID pairs: 398700
2023-08-24 09:29:01,493 - Extracting tarball: gtdb_proteins_aa_reps_r207.tar.gz
2023-08-24 09:29:01,493 - Extracting to: gtdb_to_diamond_TMP
2023-08-24 10:06:43,630 - No. of .faa(.gz) files: 65703
2023-08-24 10:06:43,675 - Creating accession2taxid table...
2023-08-24 10:06:43,676 - WARNING: Cannot find GCA003697015.1 accession in names.dmp
Traceback (most recent call last):
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 146, in <module>
main(args)
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 135, in main
accession2taxid(names_dmp, faa_files, args.outdir)
File "/usr/users/bheimbu/mambaforge/bin/gtdb_to_diamond.py", line 84, in accession2taxid
line = [acc_base, accession, str(taxID), '']
UnboundLocalError: local variable 'taxID' referenced before assignment
Cheers Bastian
I've changed your code (here is the adjusted python script) and it now runs. However, all accession numbers in accession2taxid.tsv
are assigned to Not found
, that is gtdb_to_diamond.py
gives me for every accession number, e.g. Cannot find GCA001315985.1 accession in names.dmp
. So there must be wrong with the nodes.dmp
file, right?