modkit icon indicating copy to clipboard operation
modkit copied to clipboard

High memory usage of modkit extract full

Open Shians opened this issue 1 year ago • 8 comments

I have run dorado with m5C,inosine_m6A,pseU and want to extract the modifications into a table. However when I run modkit extract full it keeps being killed for being out of memory. I've bumped memory allocation to 256GB for a 7.8GB BAM file and it's still being killed. I am wondering if there's an obvious answer for why this happens, does modkit store the whole modification table in memory before writing it out?

Shians avatar Nov 27 '24 00:11 Shians

Hello @Shians,

The command should stream the output to either standard out or a file depending on what you've specified. The program tends to be write-bound, however, so what happens is unwritten records are buffered in memory. If you've run a model with C, A, and U modification calls (and A-mods are 3-class), you'll end up accumulating many records for each read so the buffered data can be quite large (as you've seen). There are some methods in the documentation for how to reduce the memory usage. I also generally recommend, when possible, to only extract the reads that you need to operate on. For example grab the reads for a one or a few transcripts at a time.

ArtRand avatar Nov 27 '24 16:11 ArtRand

Thanks for the quick response, that's helpful to know. I think I can get around this by operating on one chromosome at a time. Based on the memory usage tips in the documentation, I am trying to understand what the relevant options are.

  1. --queue-size refers to the number of reads in memory at a time, I don't think this helps me, my understanding is that it's the data from this reads that's overloading a separate write-queue?
  2. --ignore-index not entirely sure what the implication of a serial scan is.
  3. --interval-size this is the interval size used to retrieve reads? I can't see this argument in 0.4.1.

I think ideally the write buffer should be limited in size and block processes when they try to insert while full.

Shians avatar Dec 03 '24 04:12 Shians

Hello @Shians,

I think ideally the write buffer should be limited in size and block processes when they try to insert while full.

--queue-size is what you want, it does exactly this.

--ignore-index not entirely sure what the implication of a serial scan is.

The parallelism in Modkit will operate on sections of the genome in parallel (--interval-size sections). When you pass the --ignore-index flag it disables this behavior, it will stream the modBAM from the start to the end.

--interval-size this is the interval size used to retrieve reads? I can't see this argument in 0.4.1.

You need to use the "long help" (modkit extract calls --help) to see this option.

ArtRand avatar Dec 11 '24 16:12 ArtRand

Excuse my ignorance, but it doesn't "seem" possible that --queue-size is limiting the output buffer, because the default value is set at only 10,000. How does memory usage blow out to over 100GB in that instance? Surely each entry in the queue cannot be over 10MB in size. I am assuming each entry is just a row in the output table.

Shians avatar Jan 08 '25 00:01 Shians

Hello @Shians,

As per the help string:

  -q, --queue-size <QUEUE_SIZE>
          Number of reads that can be in memory at a time. Increasing this value
          will increase thread usage, at the cost of memory usage

          [default: 10000]

The items in the queue are the blocks of information for a read. So if reads are extremely long and have modification calls at nearly every position, the memory usage could expand rapidly. If the queue is full, however, it will block the program until the writer can empty the queue.

For what it's worth, I'm actively working on this routine and trying to both reduce the memory usage and improve the ergonomics of the output tables.

ArtRand avatar Jan 08 '25 14:01 ArtRand

Hello @ArtRand , I've stumbled on this issue while trying to extract calls from my sequencing for subsequent statistical analysis.

I can report that running the extract calls with queue size of 1, I'm running out of memory on my 32gb workstation (genome size is ca 4 megabases (bacterial genome)).

dist_modkit_v0.4.3/modkit extract calls --mapped --reference ../reference.fasta --bgzf -t 1 -q 1 --pass-only sorted.aligned.bam - > mod_calls.pass.tsv.gz

SchwarzMarek avatar Mar 17 '25 14:03 SchwarzMarek

Hello @SchwarzMarek,

Could you try adding --ignore-index like this?

time_format="#command: %C\nelapsed\tdata_area\ttotal_data\tmean_cpu\n%e\t%t\t%M\t%P\n"bash
/usr/bin/time -o ${outdir}/test_timing.txt -f "${time_format}" \
  dist_modkit_v0.4.3/modkit extract calls \
    --ignore-index \
    --mapped --reference ../reference.fasta \
    --bgzf -t 1 -q 1 --pass-only sorted.aligned.bam - > mod_calls.pass.tsv.gz

ArtRand avatar Mar 17 '25 15:03 ArtRand

@ArtRand Thank you for the advice, the --ignore-index flag fixes my issues.

SchwarzMarek avatar Mar 18 '25 08:03 SchwarzMarek