bowtie2 Building index of NCBI's Refseq bacterial genomes

Building index of NCBI's Refseq bacterial genomes

Open bheimbu opened this issue 2 years ago • 4 comments

Hi there,

I'm trying a build a huge index of NCBI's Refseq bacterial genomes, which is about 97 GB (in fna.gz format). I'm working on a HPC with 512 GB RAM but it still dies always with an "out-of-memory" error. Is it possible to split up the compressed fasta file in smaller chunks, index them separately, and then concatenate the resulting indexing files in the end? Or is there another solution (use more RAM)?

Cheers Bastian

Apr 18 '23 09:04 bheimbu

Hello,

There are a few options available to you: 1. bowtie2-build has a --packed mode that should reduce the memory footprint but is slower than the standard build. 2. Split the FASTA, build indexes with the resulting files, and run separate alignments against each index. N.B. indexes cannot be merged. 3. Use a node with more memory.

Apr 18 '23 14:04 ch4rr0

Thanks,

for your reply. I'll try to use --packed and see how it goes.

Cheers Bastian

Apr 20 '23 11:04 bheimbu

Hello, I guess there might be many genomes in NCBI collection which may be very similar or possibly identical too. How does bowtie performs the read assigment in this case? It randomly assignes reads to one sequence from the pool of identical sequences? or it equally distribute the reads to all identical sequences? Thank you. I know in ideal scenario if is good to dereplicate genomes first.

Feb 21 '24 08:02 JSSaini

Hello,

bowtie2 will chose the alignment with the highest alignment score. If there are multiple of these it will chose an alignment at random. I hope this helps.

Feb 27 '24 17:02 ch4rr0

bowtie2 bowtie2 copied to clipboard

Building index of NCBI's Refseq bacterial genomes

bowtie2
bowtie2 copied to clipboard