GTDB_Kraken icon indicating copy to clipboard operation
GTDB_Kraken copied to clipboard

Downloading GTDB database from NCBI using efetch OR using --sequence-dir option : doesn't work

Open Jigyasa3 opened this issue 4 years ago • 2 comments

Dear @lskatz and @hcdenbakker

I am trying to download the GTDB database and building it using Kraken2.

When I use the following code as mentioned on Github- perl scripts/gtdbToTaxonomy.pl --infile data/gtdb.2018-12-10.tsv I get the following error- /flash/BourguignonU/DB/GTDB_Kraken/scripts/gtdbToTaxonomy.pl Loading GCF_000218425, d__Bacteria;p__Firmi...chnospiraceae;g__Dorea;s__Dorea scindens /flash/BourguignonU/DB/GTDB_Kraken/scripts/gtdbToTaxonomy.pl finding it (GCF_000218425)...from NCBI using esearch (GCF_000218425)... ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>') STACK: Error::throw STACK: Bio::Root::Root::throw /home/j/jigyasa-arora/perl5/lib/perl5/Bio/Root/Root.pm:449 STACK: Bio::SeqIO::fasta::next_seq /home/j/jigyasa-arora/perl5/lib/perl5/Bio/SeqIO/fasta.pm:137 STACK: main::main /flash/BourguignonU/DB/GTDB_Kraken/scripts/gtdbToTaxonomy.pl:127 STACK: /flash/BourguignonU/DB/GTDB_Kraken/scripts/gtdbToTaxonomy.pl:27 -----------------------------------------------------------

In order to make it work, I have to manually remove the sequence IDs that give this error. 436 of them do so.

Additionally, If I use GTDB genome fasta files already downloaded and renamed according to the help section, I get the same error but with a different genome file. It stops randomly. The first time, it generated 3000 genomes and gave an error, the second time it after 50 genomes in /library/ file.

code- perl scripts/gtdbToTaxonomy.pl --sequence-dir /flash/BourguignonU/DB/gtdb_to_diamond/genomesfiles --infile data/gtdb.2018-12-10.tsv

After removing 436 sequence IDs, the first method seems to be working so far. Any suggestions?

Jigyasa3 avatar Jan 13 '21 12:01 Jigyasa3

Hi there. I am currently experiencing the same issue as above. The only workaround is to remove the sequence IDs that give the error. But this is painful as an error occurs every 20 or 30.

Any workaround? Many thanks in advance

PeterCx avatar Apr 26 '21 13:04 PeterCx

Hello, I also experienced problems while using the script.

I changed the tool from entrez tools to "datasets", it can be downloaded from here: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install

Don't forget to chmod +x datasets!

I added my $NCBI_API_KEY="ABCDEFG"; in the main method of the script.

And this is the important change:

#system("esearch -db assembly -query $assemblyId | elink -target nuccore | efetch -format fasta > $filename.tmp");
system("./datasets --api-key $NCBI_API_KEY download genome accession $assemblyId && unzip -p ncbi_dataset.zip *.fna > $filename.tmp && rm ncbi_dataset.zip");

It works much better than eutils, I believe.

I also changed all logmsg and logmsgBg to logmsgLn for better readability of the output.

davidbio avatar Jun 27 '24 14:06 davidbio