CRISPResso2 icon indicating copy to clipboard operation
CRISPResso2 copied to clipboard

high memory usage killing job

Open jfreimer opened this issue 3 years ago • 1 comments

Describe the bug Hi, I am trying to run crispressoPooled on a fastq file with 50 million reads and the job keeps getting killed for high memory usage. I gave the job 96G of memory and it seems to have only got 1/2 way through the file.

I am running CRISPResso version 2.0.42 with the following command: CRISPRessoPooled -r1 output/fastq/trimmed/gene1_input_Donor2_R1_trimmed.fastq -r2 output/fastq/trimmed/gene1_input_Donor2_R2_trimmed.fastq -f amplicon_file.txt --min_bp_quality_or_N 30 --quantification_window_size 1 --output_folder output/crispresso/ --name gene1_input_Donor2

The process fails with the following error: Exiting because a job execution failed. Look above for error message slurmstepd: error: Detected 1 oom-kill event(s) in StepId=37726020.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I have attached the full run log. gene1_input_Donor2_crispresso.txt

jfreimer avatar Nov 06 '21 17:11 jfreimer

Hi @jfreimer,

Thanks for using CRISPResso2.

From your log, it appears that your first region has 8M unique sequences. In standard amplicon sequencing, we don't expect to see this many unique sequences, so we cache the alignments and other information for each read so that when we see the read sequence again we don't have to redo the alignment (the slow part). However, in this case caching the alignments isn't really effective because each read is only seen around 3 times (25M reads, 8M unique) and your memory is blowing up caching these reads.

You may try the following:

  1. If your reads have any unique sequences (e.g. barcodes or UMIs), try trimming these off and sticking them in your fastq read id, so more reads can be chached.
  2. Break up your original fastq into smaller files (~10M each) and run each of them through CRISPRessoPooled separately, then you can use CRISPRessoAggregate to aggregate all the runs.
  3. Upgrade to a more recent version of CRISPResso - there may be some memory efficiency wins as we moved to Python3 in version 2.2

I'm interested in your use case that generates so many unique reads - it may be the case that this isn't the right tool to analyze your data. Feel free to email me at [email protected] if you don't want to talk about it here. If this is amplicon sequencing data, I'd be interested in looking at ways to make this work - we could probably create a 'low memory' mode that doesn't cache alignments so it would take longer but not run out of memory?

kclem avatar Nov 06 '21 18:11 kclem

We're closing this issue because it hasn't been updated recently. If this issue still exists, please reopen this issue and we'll look into it!

kclem avatar Apr 13 '23 20:04 kclem