bowtie2
bowtie2 copied to clipboard
Building index of NCBI's Refseq bacterial genomes
Hi there,
I'm trying a build a huge index of NCBI's Refseq bacterial genomes, which is about 97 GB (in fna.gz format). I'm working on a HPC with 512 GB RAM but it still dies always with an "out-of-memory" error. Is it possible to split up the compressed fasta file in smaller chunks, index them separately, and then concatenate the resulting indexing files in the end? Or is there another solution (use more RAM)?
Cheers Bastian
Hello,
There are a few options available to you:
1. bowtie2-build
has a --packed
mode that should reduce the memory footprint but
is slower than the standard build.
2. Split the FASTA, build indexes with the resulting files, and run separate
alignments against each index. N.B. indexes cannot be merged.
3. Use a node with more memory.
Thanks,
for your reply. I'll try to use --packed
and see how it goes.
Cheers Bastian
Hello, I guess there might be many genomes in NCBI collection which may be very similar or possibly identical too. How does bowtie performs the read assigment in this case? It randomly assignes reads to one sequence from the pool of identical sequences? or it equally distribute the reads to all identical sequences? Thank you. I know in ideal scenario if is good to dereplicate genomes first.
Hello,
bowtie2
will chose the alignment with the highest alignment score. If there are multiple of these it will chose an alignment at random. I hope this helps.