bowtie2 icon indicating copy to clipboard operation
bowtie2 copied to clipboard

Building index of NCBI's Refseq bacterial genomes

Open bheimbu opened this issue 1 year ago • 4 comments

Hi there,

I'm trying a build a huge index of NCBI's Refseq bacterial genomes, which is about 97 GB (in fna.gz format). I'm working on a HPC with 512 GB RAM but it still dies always with an "out-of-memory" error. Is it possible to split up the compressed fasta file in smaller chunks, index them separately, and then concatenate the resulting indexing files in the end? Or is there another solution (use more RAM)?

Cheers Bastian

bheimbu avatar Apr 18 '23 09:04 bheimbu

Hello,

There are a few options available to you: 1. bowtie2-build has a --packed mode that should reduce the memory footprint but is slower than the standard build. 2. Split the FASTA, build indexes with the resulting files, and run separate alignments against each index. N.B. indexes cannot be merged. 3. Use a node with more memory.

ch4rr0 avatar Apr 18 '23 14:04 ch4rr0

Thanks,

for your reply. I'll try to use --packed and see how it goes.

Cheers Bastian

bheimbu avatar Apr 20 '23 11:04 bheimbu

Hello, I guess there might be many genomes in NCBI collection which may be very similar or possibly identical too. How does bowtie performs the read assigment in this case? It randomly assignes reads to one sequence from the pool of identical sequences? or it equally distribute the reads to all identical sequences? Thank you. I know in ideal scenario if is good to dereplicate genomes first.

JSSaini avatar Feb 21 '24 08:02 JSSaini

Hello,

bowtie2 will chose the alignment with the highest alignment score. If there are multiple of these it will chose an alignment at random. I hope this helps.

ch4rr0 avatar Feb 27 '24 17:02 ch4rr0