eager icon indicating copy to clipboard operation
eager copied to clipboard

subset large datasets for dataset-wide statistics

Open ChristopherBarrington opened this issue 1 year ago • 0 comments

Ran into a problem with DamageProfiler failing on two particularly large datasets, giving the following error:

DamageProfiler v0.4.9
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at org.apache.commons.math3.util.ResizableDoubleArray.expand(ResizableDoubleArray.java:697)
        at org.apache.commons.math3.util.ResizableDoubleArray.addElement(ResizableDoubleArray.java:442)
        at org.apache.commons.math3.stat.descriptive.DescriptiveStatistics.addValue(DescriptiveStatistics.java:171)
        at IO.OutputGenerator.computeSummaryMetrics(OutputGenerator.java:542)
        at calculations.StartCalculations.generateOutput(StartCalculations.java:357)
        at calculations.StartCalculations.start(StartCalculations.java:302)
        at RunDamageProfiler.main(RunDamageProfiler.java:37)

The input bam files are 44 or 53 GB with 1,329,979,427 and 1,516,157,394 records. I tried to change the memory resource provided to Java and using an updated version of Java (outside the eager 2.3.3 singularity image) but nothing worked.

After creating an issue in the DamageProfiler GitHub I tried to subsample the bam and then run DamageProfiler. This seemed to work and allowed me to run these two subsampled datasets for this process only and then -resume the pipeline.

The following is what I would put into the .command.sh if I were to encounter this problem again.

samtools view --subsample-seed 0 --subsample 0.05 --threads 32 --bam --output dataset_rmdup.subset.bam dataset_rmdup.bam
samtools index dataset_rmdup.subset.bam

damageprofiler -Xmx250g -i dataset_rmdup.subset.bam -r canFam3.fa -l 100 -t 15 -o . -yaxis_damageplot 0.30

mv dataset_rmdup.subset dataset_rmdup

Maybe something a bit cleverer and generic than this could be considered so that profiling processes don't end up using excessive time/memory/cpu resource on those extremely deeply sequenced datasets, and more likely crashing or being stopped.

https://github.com/nf-core/eager/blob/bb32ae3b0110b9a26b791f73a5324828d849271a/main.nf#L2105

ChristopherBarrington avatar Jan 05 '23 10:01 ChristopherBarrington