minimap2 icon indicating copy to clipboard operation
minimap2 copied to clipboard

High memory use when using Python and threads

Open cjw85 opened this issue 2 years ago • 10 comments

The program align.py uses mappy to align reads in Python using multiple worker threads. After loading the index the memory usage jumps up quickly to >20Gb and then continues to climb steadily through 40Gb an beyond.

This issue was first discovered in bonito and isolated to mappy. The data flow in the example mirrors that in bonito but reduced to using only Python stdlib functionality.

mappy: v2.24 pysam: v0.18 (just for optionally reading fastq inputs) python: v3.8.6

Run program, creating query sequences from index on the fly

python align.py GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi --threads 48

or using a directory containing *.fastq* files:

python align.py GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi --fastq_dir FAQ32498 --threads 48

The inputs I am using are available in the AWS S3 bucket at:

s3://ont-research/misc/mappy-mem/FAQ32498.tar
s3://ont-research/misc/mappy-mem/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.mmi

I've not fully ascertained if using lots of threads exacerbates the problem or simply makes the symptom apparent more quickly.

cjw85 avatar Jan 05 '22 17:01 cjw85

@lh3 I or might be able to spare some time to dig through the Cython (though @marcus1487 is more a Cython person than me). valgrind gave me quite a bit of noise when I quickly ran it yesterday.

cjw85 avatar Jan 06 '22 10:01 cjw85

I've looked at this a little today. If I modify the program to not reuse the ThreadBuffer for each call to aligner.map() I don't observe such egregious memory use.

@lh3 Am I correct in thinking the minimap2 program does not use persistent mm_tbuf_ts for its entire lifetime? I'm starting to think this isn't a leak as such but an expansion in a buffer within mm_tbuf_ts as pathological reads/alignments are processed, with a buffer not being shrunk afterwards?

cjw85 avatar Jan 06 '22 18:01 cjw85

Sorry that I don't use python threads and I don't know how python threads handle global and thread-local memory. Anyway, a ThreadBuffer only grows and never shrinks, until it gets destroyed. It is intended to be used through the life span of a thread. Minimap2 allocates one ThreadBuffer inside a newly spawned thread and uses the same buffer for multiple reads the thread processes. Minimap2 deallocates the buffer towards the end of the thread.

lh3 avatar Jan 06 '22 19:01 lh3

Anyway, a ThreadBuffer only grows and never shrinks

Actually a ThreadBuffer may shrink. The following block means if the size of the buffer is larger than opt->cap_kalloc (default to 1GB in v2.24) or the largest memory block is over 256MB, reallocate the thread buffer.

https://github.com/lh3/minimap2/blob/06fedaadd0f88074bd68527e6e34634ffe21273e/map.c#L367-L378

lh3 avatar Jan 06 '22 19:01 lh3

After a bit my prodding from both myself and @jts, I'm fairly well convinced that the high memory use I've observed is simply an accumulation in the size of the thread buffer, nothing untoward in Python or Cython. I do occasionally see sizeable (i.e. 1Gb) deallocations.

If the example Python program is changed to periodically use a new ThreadBuffer in each thread, or not pass one to aligner.map() calls, memory use is more controlled. Both of us have also observed that when kalloc is disabled the example Python program does not have excessive memory use.

The part that I am still perplexed by is why this is happening in the Python program but not in minimap2 when applied to the same dataset. I have a theory it might simply come down to how work is being processed by the thread pools in the two cases and how often the allocation cap is therefore being hit and the thread buffer being reset.

cjw85 avatar Jan 07 '22 20:01 cjw85

After studying things more, I'm relatively well satisfied that in a sense this is the intended behaviour of the code and not a bug per-se. (I will change the title of this issue to reflect this).

I have datasets where for aligning HG002 reads GRCh38 and using the minimap2 program not mappy I see runaway memory usage up to around 65GB on top of a baseline of around 20GB. I see there are various other issues reporting similar behaviour.

(using minimap v2.27: `minimap2 -t 64 -a -x map-ont grch38.fastq.gz reads.fastq.gz)

image

If I disable use of kalloc I see much more stable memory usage, and no loss in performance. This begs the question: when does the use of kalloc out perform vanilla use of malloc in minimap2?

cjw85 avatar Mar 14 '24 17:03 cjw85

Malloc performance is system dependent. When minimap2 was developed in 2018, kalloc was giving considerable performance improvement, on our server, over glibc (CentOS 6), musl and rpmalloc and minor improvement over tcmalloc and jemalloc. Similarly for bwa-mem, some users and myself could observe large performance increase with tcmalloc but some other users didn't see this.

Minimap2 does frequent heap allocation per read and across threads. Allocators are usually sensitive to this pattern. It is safer to enable kalloc for consistent performance across systems. One thing I may try is to reset kalloc much more frequently, for example, reset per million query bases. The resetting logic is currently implemented here:

https://github.com/lh3/minimap2/blob/9b0ff2418c298f5b5d0df12b137896e5c3fb0ef4/map.c#L362-L373

Resetting for every read would look like:

if (b->km) {
    km_destroy(b->km);
    b->km = km_init();
}

lh3 avatar Mar 14 '24 17:03 lh3

Looking at the source code, I realized another way to control kalloc resetting frequency is to add --cap-kalloc. It defaults to 1GB – this is partly why the memory in your run was peaked at ~1 GB per thread. You may set a smaller --cap-kalloc and see what happens.

lh3 avatar Mar 14 '24 17:03 lh3

For what its worth I'm using:

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

so nothing blazingly new, but not terribly crusty either. Maybe I'll spend my evening going into the weeds of glibc changes.

I've got a few experiments running including setting --cap-kalloc smaller, I'd like to be comfortably below 32GB as a baseline; the vanilla malloc test shows that's certainly possible (for this dataset at least).

By the way, I noticed that the define HAVE_KALLOC appears to only apply to some parts of the code, I don't know if that was intentional or not.

cjw85 avatar Mar 14 '24 18:03 cjw85

Setting --cap-kalloc 100m --cap-sw-mem 50m (no particular reason for those choices, other than being smaller than the defaults) does provide more controlled memory usage as intended. The performance isn't noticeably worse so far with these settings.

Tomorrow I may look at the Python code to see if it can be made to expose these options.

cjw85 avatar Mar 14 '24 22:03 cjw85