GTDB_Kraken
GTDB_Kraken copied to clipboard
Downloading GTDB database from NCBI using efetch OR using --sequence-dir option : doesn't work
Dear @lskatz and @hcdenbakker
I am trying to download the GTDB database and building it using Kraken2.
When I use the following code as mentioned on Github-
perl scripts/gtdbToTaxonomy.pl --infile data/gtdb.2018-12-10.tsv
I get the following error-
/flash/BourguignonU/DB/GTDB_Kraken/scripts/gtdbToTaxonomy.pl Loading GCF_000218425, d__Bacteria;p__Firmi...chnospiraceae;g__Dorea;s__Dorea scindens /flash/BourguignonU/DB/GTDB_Kraken/scripts/gtdbToTaxonomy.pl finding it (GCF_000218425)...from NCBI using esearch (GCF_000218425)... ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>') STACK: Error::throw STACK: Bio::Root::Root::throw /home/j/jigyasa-arora/perl5/lib/perl5/Bio/Root/Root.pm:449 STACK: Bio::SeqIO::fasta::next_seq /home/j/jigyasa-arora/perl5/lib/perl5/Bio/SeqIO/fasta.pm:137 STACK: main::main /flash/BourguignonU/DB/GTDB_Kraken/scripts/gtdbToTaxonomy.pl:127 STACK: /flash/BourguignonU/DB/GTDB_Kraken/scripts/gtdbToTaxonomy.pl:27 -----------------------------------------------------------
In order to make it work, I have to manually remove the sequence IDs that give this error. 436 of them do so.
Additionally, If I use GTDB genome fasta files already downloaded and renamed according to the help section, I get the same error but with a different genome file. It stops randomly. The first time, it generated 3000 genomes and gave an error, the second time it after 50 genomes in /library/ file.
code-
perl scripts/gtdbToTaxonomy.pl --sequence-dir /flash/BourguignonU/DB/gtdb_to_diamond/genomesfiles --infile data/gtdb.2018-12-10.tsv
After removing 436 sequence IDs, the first method seems to be working so far. Any suggestions?
Hi there. I am currently experiencing the same issue as above. The only workaround is to remove the sequence IDs that give the error. But this is painful as an error occurs every 20 or 30.
Any workaround? Many thanks in advance
Hello, I also experienced problems while using the script.
I changed the tool from entrez tools to "datasets", it can be downloaded from here:
https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install
Don't forget to chmod +x datasets
!
I added my $NCBI_API_KEY="ABCDEFG";
in the main method of the script.
And this is the important change:
#system("esearch -db assembly -query $assemblyId | elink -target nuccore | efetch -format fasta > $filename.tmp");
system("./datasets --api-key $NCBI_API_KEY download genome accession $assemblyId && unzip -p ncbi_dataset.zip *.fna > $filename.tmp && rm ncbi_dataset.zip");
It works much better than eutils, I believe.
I also changed all logmsg
and logmsgBg
to logmsgLn
for better readability of the output.