KEMET
KEMET copied to clipboard
add_taxonomy_from_gtdb-tk.py - help!
I am trying to run this script but it keeps returning with this "The genomes.instruction file has been updated with 0 genome(s) taxonomy indications, using '.fasta' extension" Could you please tell me if there is anything that I can do to fix it ?
Hello!
To properly reply I'd need a little more informations, such as:
- which specific command line you used?
- what's the format of your input file(s)? i.e. did you use a compatible version of gtdb-tk and/or GTDB database?
Best, Matteo
Hi Matteo,
I installed KEMET on a UNIX system through conda and ran the script add_taxonomy_from_gtdb-tk.py I ran my genomes through the classify microbes with GTDB-Tk-v2 3.2 workflow available on Kbase. The output files from that were used to run the gtdb to ncbi majority vote script which provided me with a .tsv file containing id no, GTDB classification and NCBI classification. I ensured that the sample/id names are same on the .tsv file and the genomes.instruction file prior to running the add taxonomy script.
Hope this helps
Thank you!
Thanks for the extra details!
I've only tested the script from input obtained with gtdb-tk command line (so a difference could arise from that aspect).
Same goes for the gtdb-to-ncbi script, which depends on a specific version of the GTDB database.. Right now the add_taxonomy_from_gtdb-tk.py script used to work for the 2022 "GTDB R07-RS207" release, as well as 2022 NCBI taxonomy.
I'm not excluding that major changes in taxonomy could have actually happened (I remember some changes regarding Firmicutes to Bacillota maybe?). - This would require fixing the correspondance from NCBI to KEGG BRITE taxonomy.
Else my suspect would be regarding the file extensions of your genomes/MAGs files (whether it was .fasta, .fa, .fna, as it is required from the script in object and specified through the -f argument when running it.
Best regards, Matteo
Hi Matteo,
Thank you!
The file extensions and names match in the genomes.instruction file and the output file from GTDB. I downloaded the metadata files for r207 and ran the gtdb to ncbi script and used the output file from that to run the add_taxonomy and it worked. However, when i ran the kemet.py code i ran into an error
File "kemet.py", line 781, in taxonomy_filter
for line in v[i_start+1:]:
UnboundLocalError: local variable 'i_start' referenced before assignment
Could you kindly guide me with this error?
Hi Matteo, Thank you! The file extensions and names match in the genomes.instruction file and the output file from GTDB. I downloaded the metadata files for r207 and ran the gtdb to ncbi script and used the output file from that to run the add_taxonomy and it worked.
Nice to know! Could you specify what you did precisely? This could serve as a temporary fix until I modify a few things 🙃
Right now I've seen that KEGG BRITE was updated to reflect the changes in the NCBI taxonomy as expected, therfore it will take a couple checks to bring the add_taxonomy script up-to-date.
However, when i ran the kemet.py code i ran into an error File "kemet.py", line 781, in taxonomy_filter for line in v[i_start+1:]: UnboundLocalError: local variable 'i_start' referenced before assignment
Could you kindly guide me with this error?
Do you have the KEGG BRITE file br08601.keg in your working folder? This should be downloaded automatically when setting the working folder via the set_kemet_working-directory.py script.
If not, the file should be there. Else, I'll need to check if that file is still formatted in the way it was in 2022.
Best regards, Matteo
I'm also having a very similar issue. For some reason it won't match the genome names in genomes.instruction to the gtdb to ncbi output despite them having identical names beside the file extensions (.fna) I've spent awhile trying to debug the script but I can find no solution or any reason why there is a problem. It just runs, and if I didn't add some statement outputs for whether or not it matched I'd just see no further input to the genomes.instruction file. But it's clear that for some reason it isn't correctly matching the names despite the fact that the names are identical. I can't figure it out.