UMICollapse icon indicating copy to clipboard operation
UMICollapse copied to clipboard

Java heap space

Open karlkashofer opened this issue 3 years ago • 4 comments

Running umicollapse on 200mio paired end reads (400 reads total) runs out of Java heap space even with -Xmx96G. Is that normal ?

karlkashofer avatar Feb 08 '22 08:02 karlkashofer

It should not fail with only 400 reads. Have you tried setting -Xms to a larger value? That is the initial heap size. What is the exact command you are running? Paired-end mode takes up more memory, but it shouldn't run out of memory for only 400 reads.

Daniel-Liu-c0deb0t avatar Feb 10 '22 22:02 Daniel-Liu-c0deb0t

Sorry, i meant 200mio paired end reads which is 400mio reads total.

karlkashofer avatar Apr 03 '22 12:04 karlkashofer

If you are using paired-end mode (--paired), it takes a lot of memory. This is because it has to make sure pairs of reads stay together during the deduplication process. This involves storing a lot of reads in memory. Potential workarounds could be splitting the 200 million paired end reads into smaller files and deduplicating them, or not using paired-end mode (but then there might exist pairs of reads where only one read of the pair is removed).

Daniel-Liu-c0deb0t avatar Apr 04 '22 15:04 Daniel-Liu-c0deb0t

Yes, i use --paired as this is Illumina NovaSeq data from Agilent XT libraries (dual index and dual UMI). I dont really understand why --paired need so much memory. In your paper you state "the reads at each unique alignment location are independently deduplicated based on the UMI sequences. ", so i understand it only needs to keep all reads at a single position within memory. I deduplicate WGS data, there is hardly a position with more than 100 reads, so i really dont understand why it would require > 80GB of memory.

Thanks for your work btw ! :)

karlkashofer avatar Apr 05 '22 05:04 karlkashofer