krakenuniq
krakenuniq copied to clipboard
GTDB Build Error
Hi,
I'm trying to build a custom database for the GTDB database using the following URL: https://data.gtdb.ecogenomic.org/releases/release202/202.0/genomic_files_reps/ This particular file: gtdb_genomes_reps_r202.tar.gz
After opening the file, there is approximately 155GB of genome FASTA files. I am using the following parameters since I kept running out of memory: --kmer-len 20 --minimizer-len 10 --jellyfish-hash-size 550000 --work-on-disk --threads 12
Everything was going fine until about 10 hours into the build. I want to ask if anyone has an idea what could be wrong. The following is my log file with the error:
Found jellyfish v1.1.11
Kraken build set to minimize RAM usage.
Found 47894 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
terminate called after throwing an instance of 'jellyfish::compacted_hash::ErrorReading'
what(): 'database_100915': File truncated
/home/psundar/.conda/envs/WGS/bin/build_db.sh: line 146: 11741 Aborted (core dumped) $JELLYFISH_BIN merge -o database.jdb.tmp database_*
I'd also suggest using a longer k-mer (25) and minimizer (14 or more). I don't think the shorter k-mer will save much memory for you (maybe a little), and it will give you many false positives when you later run Kraken. But it looks like you probably just ran out of memory - I'm not sure what that jellyfish error implies, but it's probably just the memory. That's a huge database you're trying to build. You might split it into 2 or more and try it that way.