lucene icon indicating copy to clipboard operation
lucene copied to clipboard

MMapDirectory sometimes "leaks" 1000s of maps

Open mikemccand opened this issue 4 months ago • 49 comments

[I'm opening this for discussion ... I'm not sure we have to fix anything here, but I at least want to document the situation so if other Lucene users hit max maps limit, we can quickly explain why]

At Amazon product search, we've seen our indexer processes sometimes trip the OS hard limit (64 K, though modern Linuxes (Lini?) have increased to 1 MiB) of number of memory-mapped segments, killing the indexing process. Looking at the maps (cat /proc/<pid>/maps) it's clear we are leaking maps for _N.si files, e.g. the exact same _7.si file will be mapped 76 times, and same for _8.si and all other open segments.

I have a small test case that reproduces this leak (I'll attach it shortly), on 9.12.x, with Java 21 or 24. It does not reproduce on 10.x because we've changed _N.si files to open with IOContext.READONCE which turns off the Arena pooling (sets confined=true). To repro, you need an index that has at least one segment. Then you need to hold open a DirectoryReader (which creates an Arena for each segment's open files). Then, periodically, read the latest segments_N file, which opens and closes each segment's _N.si file, and those _N.si maps will be added to the still-open Arenas, not unmap'd until you close your reader or a segment is merged away.

This is not serious for us (Amazon product search) -- we have workaround for 9.12.x (use NIOFSDirectory), or, upgrade to 10.x. Maybe setting -Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits to a smallish value (defaults to 1024) would work too, not sure.

Context: we added this cool Arena pooling to Lucene to amortize the sometimes highish JDK cost of unmap (which deopts top frames of all running threads to check that they are not accessing the virtual address space about to be unmap'd), which in turn was discovered by an upstream benchmark (thank you!!). https://github.com/apache/lucene/pull/13555 and https://github.com/apache/lucene/issues/13325 have more context. JDK-8335480, delivered in JDK 24, tries to reduce the JDK deopt cost of unmap.

Our usage was somewhat expert (periodically reading the latest commit point (segments_N file) while indexing), and the leak is fixed in 10.x. But there is maybe still some open risk if an app uses Lucene APIs to open other files, e.g. maybe the app does lots of deletes against old segments, and maps/unmaps those deletion files, and those "leak"? Or maybe these paths that reproduce the "leak" are so expert that users won't hit them in practice in 9.x / 10.x? Or maybe we should decrease the default max maps in a single arena from 1024?

Version and environment details

JDK 21 and 24, Lucene 9.12.x

mikemccand avatar Aug 14 '25 11:08 mikemccand

This test case repros the leak on 9.12.x:

import java.io.IOException;
import java.nio.file.Paths;
import java.util.concurrent.atomic.AtomicLong;
import java.util.Random;
import java.nio.file.Files;
import java.nio.file.Path;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.NoMergePolicy;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.SegmentInfos;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.store.MMapDirectory;
import org.apache.lucene.tests.util.LineFileDocs;

// export LUCENE_VERSION=9.12.3; java -fullversion; ./gradlew jar; javac -verbose /home/mike/TestSILeak.java -cp lucene/core/build/libs/lucene-core-$LUCENE_VERSION-SNAPSHOT.jar:lucene/test-framework/build/libs/lucene-test-framework-$LUCENE_VERSION-SNAPSHOT.jar; java -cp /home/mike:lucene/core/build/libs/lucene-core-$LUCENE_VERSION-SNAPSHOT.jar:lucene/test-framework/build/libs/lucene-test-framework-$LUCENE_VERSION-SNAPSHOT.jar:lucene/codecs/build/libs/lucene-codecs-$LUCENE_VERSION-SNAPSHOT.jar:/home/mike/.gradle/wrapper/dists/gradle-8.14-bin/38aieal9i53h9rfe7vjup95b9/gradle-8.14/lib/junit-4.13.2.jar:/home/mike/.gradle/wrapper/dists/gradle-8.14-bin/38aieal9i53h9rfe7vjup95b9/gradle-8.14/lib/hamcrest-core-1.3.jar:/home/mike/.gradle/caches/modules-2/files-2.1/com.carrotsearch.randomizedtesting/randomizedtesting-runner/2.8.1/55ffe691e90d31ab916746516654b5701e532d6f/randomizedtesting-runner-2.8.1.jar TestSILeak indextmp

public class TestSILeak {

  public static void checkMapFile() throws IOException {
    Path filePath = Path.of("/proc/" + ProcessHandle.current().pid() + "/maps");
    long totLineCount = Files.lines(filePath).count();

    long siLineCount = Files.lines(filePath)
      .filter(line -> line.contains(".si"))
      .count();

    System.out.println("MAP COUNT (from " + filePath + "): " + totLineCount + " [" + siLineCount + " si files]");
  }

  public static void main(String[] args) throws IOException, InterruptedException {

    if (args.length != 1) {
      System.out.println("Usage: java TestMSILeak <index dir>");
      System.exit(-1);
    }
    MMapDirectory dir = new MMapDirectory(Paths.get(args[0]));
    System.out.println("map file: /proc/" + ProcessHandle.current().pid() + "/maps");

    IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());
    iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    iwc.setUseCompoundFile(false);
    // so we are sure we get the N flushed segments:
    iwc.setMergePolicy(NoMergePolicy.INSTANCE);

    Random r = new Random(42 * 17);
    LineFileDocs lfd = new LineFileDocs(r);

    try (IndexWriter writer = new IndexWriter(dir, iwc)) {
      for(int iter=0;iter<10;iter++) {
        for(int i=0;i<1000;i++) {
          writer.addDocument(lfd.nextDoc());
        }
        writer.commit();
      }
    }

    // hold a reader open
    DirectoryReader reader = DirectoryReader.open(dir);
    System.out.println("\nREADER: " + reader);

    while (true) {
      // each open will add the N _seg.si files as new maps, until we hit the OS limit:
      SegmentInfos sis = SegmentInfos.readLatestCommit(dir);
      //System.out.println("SIS: " + sis);
      checkMapFile();
      Thread.sleep(1000);
    }
  }
}

Check out 9.12.x branch and run that long command-line comment, tweaking the paths properly to your env, and it produces output something like this:

openjdk full version "21.0.7+6"
Starting a Gradle Daemon, 4 busy and 4 incompatible Daemons could not be reused, use --status for details
BUILD SUCCESSFUL in 6s
WARNING: A restricted method in java.lang.foreign.Linker has been called
WARNING: java.lang.foreign.Linker::downcallHandle has been called by the unnamed module
WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for this module

Aug 14, 2025 8:05:57 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput and native madvise support with Java 21 or later; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
map file: /proc/206135/maps
Aug 14, 2025 8:05:57 AM org.apache.lucene.internal.vectorization.VectorizationProvider lookup
WARNING: Java vector incubator module is not readable. For optimal vector performance, pass '--add-modules jdk.incubator.vector' to enable Vector API.

READER: StandardDirectoryReader(segments_z:106 _p(9.12.3):C1000:[diagnostics={os=Linux, java.vendor=Arch Linux, os.arch=amd64, os.version=6.15.2-arch1-1, lucene.version=9.12.3, source=flush, timestamp=1755173157857, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=1\
1pns8d06il0choqhaknlj9aj _q(9.12.3):C1000:[diagnostics={os=Linux, java.vendor=Arch Linux, os.arch=amd64, os.version=6.15.2-arch1-1, lucene.version=9.12.3, source=flush, timestamp=1755173158149, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=11pns8d06il0choqhaknlj9\
an _r(9.12.3):C1000:[diagnostics={os=Linux, java.vendor=Arch Linux, os.arch=amd64, os.version=6.15.2-arch1-1, lucene.version=9.12.3, source=flush, timestamp=1755173158408, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=11pns8d06il0choqhaknlj9ar _s(9.12.3):C1000:[d\
iagnostics={os=Linux, java.vendor=Arch Linux, os.arch=amd64, os.version=6.15.2-arch1-1, lucene.version=9.12.3, source=flush, timestamp=1755173158693, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=11pns8d06il0choqhaknlj9av _t(9.12.3):C1000:[diagnostics={os=Linux, \
java.vendor=Arch Linux, os.arch=amd64, os.version=6.15.2-arch1-1, lucene.version=9.12.3, source=flush, timestamp=1755173158985, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=11pns8d06il0choqhaknlj9az _u(9.12.3):C1000:[diagnostics={os=Linux, java.vendor=Arch Linux\
, os.arch=amd64, os.version=6.15.2-arch1-1, lucene.version=9.12.3, source=flush, timestamp=1755173159271, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=11pns8d06il0choqhaknlj9b3 _v(9.12.3):C1000:[diagnostics={os=Linux, java.vendor=Arch Linux, os.arch=amd64, os.ve\
rsion=6.15.2-arch1-1, lucene.version=9.12.3, source=flush, timestamp=1755173159574, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=11pns8d06il0choqhaknlj9b7 _w(9.12.3):C1000:[diagnostics={os=Linux, java.vendor=Arch Linux, os.arch=amd64, os.version=6.15.2-arch1-1, \
lucene.version=9.12.3, source=flush, timestamp=1755173159887, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=11pns8d06il0choqhaknlj9bb _x(9.12.3):C1000:[diagnostics={os=Linux, java.vendor=Arch Linux, os.arch=amd64, os.version=6.15.2-arch1-1, lucene.version=9.12.3,\
 source=flush, timestamp=1755173160211, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=11pns8d06il0choqhaknlj9bf _y(9.12.3):C1000:[diagnostics={os=Linux, java.vendor=Arch Linux, os.arch=amd64, os.version=6.15.2-arch1-1, lucene.version=9.12.3, source=flush, timesta\
mp=1755173160536, java.runtime.version=21.0.7+6}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=11pns8d06il0choqhaknlj9bj)
MAP COUNT (from /proc/206135/maps): 512 [10 si files]
MAP COUNT (from /proc/206135/maps): 521 [20 si files]
MAP COUNT (from /proc/206135/maps): 531 [30 si files]
MAP COUNT (from /proc/206135/maps): 541 [40 si files]
MAP COUNT (from /proc/206135/maps): 551 [50 si files]
MAP COUNT (from /proc/206135/maps): 561 [60 si files]
MAP COUNT (from /proc/206135/maps): 571 [70 si files]
MAP COUNT (from /proc/206135/maps): 581 [80 si files]
MAP COUNT (from /proc/206135/maps): 591 [90 si files]
MAP COUNT (from /proc/206135/maps): 601 [100 si files]
MAP COUNT (from /proc/206135/maps): 611 [110 si files]
MAP COUNT (from /proc/206135/maps): 621 [120 si files]
MAP COUNT (from /proc/206135/maps): 631 [130 si files]
MAP COUNT (from /proc/206135/maps): 641 [140 si files]
MAP COUNT (from /proc/206135/maps): 651 [150 si files]
MAP COUNT (from /proc/206135/maps): 661 [160 si files]
MAP COUNT (from /proc/206135/maps): 671 [170 si files]
MAP COUNT (from /proc/206135/maps): 681 [180 si files]
MAP COUNT (from /proc/206135/maps): 691 [190 si files]
MAP COUNT (from /proc/206135/maps): 701 [200 si files]
MAP COUNT (from /proc/206135/maps): 711 [210 si files]
MAP COUNT (from /proc/206135/maps): 721 [220 si files]
MAP COUNT (from /proc/206135/maps): 731 [230 si files]
MAP COUNT (from /proc/206135/maps): 741 [240 si files]
MAP COUNT (from /proc/206135/maps): 751 [250 si files]
MAP COUNT (from /proc/206135/maps): 761 [260 si files]
MAP COUNT (from /proc/206135/maps): 771 [270 si files]
MAP COUNT (from /proc/206135/maps): 781 [280 si files]
MAP COUNT (from /proc/206135/maps): 791 [290 si files]
MAP COUNT (from /proc/206135/maps): 801 [300 si files]
MAP COUNT (from /proc/206135/maps): 811 [310 si files]
MAP COUNT (from /proc/206135/maps): 821 [320 si files]
MAP COUNT (from /proc/206135/maps): 831 [330 si files]
MAP COUNT (from /proc/206135/maps): 841 [340 si files]
MAP COUNT (from /proc/206135/maps): 851 [350 si files]
MAP COUNT (from /proc/206135/maps): 861 [360 si files]
MAP COUNT (from /proc/206135/maps): 871 [370 si files]
MAP COUNT (from /proc/206135/maps): 881 [380 si files]
MAP COUNT (from /proc/206135/maps): 891 [390 si files]
MAP COUNT (from /proc/206135/maps): 901 [400 si files]

I'm not sure why my gradle leaks daemons like a sieve.

mikemccand avatar Aug 14 '25 12:08 mikemccand

Assuming , JDK 21+ and Lucene 9.12.x: I have not looked in detail, but SegmentInfos.readLatestCommit uses IOContext.READONCE, which means that we should use a confined Arena, and that arena should be closed which would in turn release the mapping. Clearly something is wrong, but it seems more like a bug in 9.x, since the usage in readLatestCommit should result in a confined arena, rather than the more complex shared arena code path.

ChrisHegarty avatar Aug 14 '25 13:08 ChrisHegarty

Can we also look into this one #15054 looks related maybe that explains Opensearch's problem in their RemoteDirectory. It is unclear what they are doing but they also leak maps. As they don't explain what they're doing I gave up, but it looks related to some extent. I know it's a different Amazon business, but complaints look similar.

I'd strongly recommend to lower the 1024 default. This has shown problematic at many places. IMHO, 64 is much better for pooling.

uschindler avatar Aug 14 '25 14:08 uschindler

We should check if there's some general leak in the shared arenas in 9.x.

uschindler avatar Aug 14 '25 14:08 uschindler

I'm not sure why my gradle leaks daemons like a sieve.

Hahaha. 🤓🐰

uschindler avatar Aug 14 '25 14:08 uschindler

I'd strongly recommend to lower the 1024 default. This has shown problematic at many places. IMHO, 64 is much better for pooling.

Agreed.

ChrisHegarty avatar Aug 14 '25 15:08 ChrisHegarty

In 9.12.x I see the index input being opened with IOContext.READ, from SegmentInfos.parseSegmentInfos, which means that it will not use a confined arena. :-(

	at org.apache.lucene.store.MemorySegmentIndexInputProvider.openInput(MemorySegmentIndexInputProvider.java:71)
	at org.apache.lucene.store.MemorySegmentIndexInputProvider.openInput(MemorySegmentIndexInputProvider.java:33)
	at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:394)
	at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156)
	at org.apache.lucene.codecs.lucene99.Lucene99SegmentInfoFormat.read(Lucene99SegmentInfoFormat.java:94)
	at org.apache.lucene.index.SegmentInfos.parseSegmentInfos(SegmentInfos.java:419)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:376)
	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:312)
	at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:554)
	at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:551)
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:828)
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:778)
	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:556)
	at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:540)
	at TestSILeak.main(TestSILeak.java:69)
...

ChrisHegarty avatar Aug 14 '25 16:08 ChrisHegarty

Maybe this explains also why so many people have problems with 9.x, especially when they reopen indexes all the time. There are many issues open about this. The OpenSearch problem maybe related to it but could be caused by the fact that they already are close to the limit of maps when the issue appears.

uschindler avatar Aug 14 '25 16:08 uschindler

As a quick workaround for 9.x one could use:

mmapDir.setGroupingFunction(fn -> fn.endsWith(".si") ? Optional.empty() : MMapDirectory.GROUP_BY_SEGMENT.apply(fn));

Uwe

uschindler avatar Aug 14 '25 16:08 uschindler

@ChrisHegarty: I was reading the code of the RefCountedArena. I still don't understand how this works with the lower and upper 16 bits. I still have the feeling that there is some bug somewhere. I also did not find any test for that code.

I am wondering abount people who told me that they used maxPermits=1 and still run out of mappings... (see #15054).

uschindler avatar Aug 14 '25 17:08 uschindler

@ChrisHegarty: I was reading the code of the RefCountedArena. I still don't understand how this works with the lower and upper 16 bits. I still have the feeling that there is some bug somewhere. I also did not find any test for that code.

I am wondering abount people who told me that they used maxPermits=1 and still run out of mappings... (see #15054).

We can also discuss this in the other issue. I am still starring at the code and don't get how it works with decrementing. What's the difference between aquires (upper 16 bit) and counts (lower 16 bits), but actually it only decrements the lower 16 bits? Sorry, maybe it's too hot with 35 degrees in Bremen...

uschindler avatar Aug 14 '25 17:08 uschindler

I'm not sure why my gradle leaks daemons like a sieve.

Hahaha. 🤓🐰

This is probably because you're using different vms or gradle options (jvm options). I can't say. You (we) have three options - run without the daemon (Robert's choice), lower daemon expiration timeout, write our own build system from scratch. I'm recently leaning towards (3)...

dweiss avatar Aug 15 '25 06:08 dweiss

I'm not sure why my gradle leaks daemons like a sieve.

Hahaha. 🤓🐰

This is probably because you're using different vms or gradle options (jvm options). I can't say. You (we) have three options - run without the daemon (Robert's choice), lower daemon expiration timeout, write our own build system from scratch. I'm recently leaning towards (3)...

Let's start a project called "Policeman Build Slave" which is named non-inclusively because your computer has to do what you want and not start any arbitrary daemons. Policeman Build Slave would be written in pure Java with its own bytecode-compiled task language and forbids any bullshit by default with integrated linters everywhere.

uschindler avatar Aug 15 '25 06:08 uschindler

We can also discuss this in the other issue. I am still starring at the code and don't get how it works with decrementing. What's the difference between aquires (upper 16 bit) and counts (lower 16 bits), but actually it only decrements the lower 16 bits? Sorry, maybe it's too hot with 35 degrees in Bremen...

it's a balmy 28 here at the Baltic seaside but I also don't get the code in RefCountedSharedArena.

dweiss avatar Aug 15 '25 06:08 dweiss

I think we should at least write a unit test just for the ref counted shared arena (it's possible in main easily). The current tests only test mmap dir and check that it does not break, but we do not have no test in main that creates an arena and tests it. A test would also explain how it should work. It may also explain why you need 2 separate reference counters.

uschindler avatar Aug 15 '25 06:08 uschindler

Ok, I think I get it - the state counts acquire/release permits in both the upper and the lower part of state variable; the distinction is made to differentiate the closed vs nothing-to-release message. Not sure if it's needed or if it could be made simpler.

dweiss avatar Aug 15 '25 07:08 dweiss

Amazingly, ChatGPT also got the gist of it. Wild. https://chatgpt.com/c/689edb07-79c8-8326-bfe2-6e9b28f5a452

dweiss avatar Aug 15 '25 07:08 dweiss

The code seems correct to me (and I ran a simple test to verify). It's more likely that the reference to a RefCountedArena object itself is lost somewhere. This can be debugged with some trickery - add a static Cleaner that tracks gc on the allocated Arenas and makes sure their state count is zeroed/released when arena is gc'ed. If it's not zeroed - it means ref counting failed somewhere. It's a bit convoluted but works - I have code like this used somewhere. I can't work on it today - will be travelling - but if needed, let me know.

dweiss avatar Aug 15 '25 07:08 dweiss

I'm in awe. I asked chatgpt to recreate what I have written in the past and it did a flawless job, actually. Wow.

Anyway. https://github.com/apache/lucene/pull/15077 has what I had in mind - I wonder if it'll be able to track the problem.

dweiss avatar Aug 15 '25 07:08 dweiss

The tests seem to pass with the ref counting trick for me. Maybe Mike's repro will fail though - something worth trying.

dweiss avatar Aug 15 '25 07:08 dweiss

Oh, sorry. I checked this on main, not 9x. Maybe it'll help to diagnose the issue on 9x though!

dweiss avatar Aug 15 '25 07:08 dweiss

The tests seem to pass with the ref counting trick for me. Maybe Mike's repro will fail though - something worth trying. Oh, sorry. I checked this on main, not 9x. Maybe it'll help to diagnose the issue on 9x though!

Actually the reason for the above issue is already known in 9.x, it is because the .si files are not opened with READONCE.

Helping to diagnose leaking shared arenas may help people in #15054, although I don't think the OpenSearch issue is really related to a bug in those reference counting, it is just that they have too many mappings from the beginning and when using MMapDircetory for replication, too (which I don't uderstand why they do this), its a problem in OS and not Lucene. The Lucene directory abstractions are optimized to read/write indexes (random access) but mmap is not made for sequential access to copy data over the wire. You can do this with READONCE but then you have to live with the fact that it needs to be single-threaded.

uschindler avatar Aug 15 '25 08:08 uschindler

The problem of Mike's code is mainly that he calls getSegmetInfos over and over for the same segment. Of course when you open a new IndexInput, it creates a new mapping each time (our refcounted Arena is not able to reuse existing mappings).

uschindler avatar Aug 15 '25 08:08 uschindler

The reason I didn't add unit tests for RefCountedSharedArena was that it was in the versioned section when first added. We should add unit tests for it now, at least in main.

(I've have reduced connectivity due to travel, please do not wait on me, but I will try to help)

ChrisHegarty avatar Aug 15 '25 10:08 ChrisHegarty

Maybe setting -Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits to a smallish value (defaults to 1024) would work too, not sure.

I tested -Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits=1 and it does indeed prevent the leak with my small test case on 9.12.x, JDK 21:

MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]

mikemccand avatar Aug 15 '25 11:08 mikemccand

I am wondering abount people who told me that they used maxPermits=1 and still run out of mappings... (see https://github.com/apache/lucene/issues/15054).

Yeah it's odd it didn't help that issue ... it sidesteps the issue in 9.12.x (before @ChrisHegarty's fix to switch _N.si to READONCE so it uses "confined arena").

mikemccand avatar Aug 15 '25 11:08 mikemccand

I'm not sure why my gradle leaks daemons like a sieve.

Hahaha. 🤓🐰

This is probably because you're using different vms or gradle options (jvm options). I can't say. You (we) have three options - run without the daemon (Robert's choice), lower daemon expiration timeout, write our own build system from scratch. I'm recently leaning towards (3)...

LOL -- I love option 3 -- I've done this too many times in the past!

Thanks @dweiss. I do jump around with different JVMs, different Lucene branches, etc. Probably freaks gradle out ...

mikemccand avatar Aug 15 '25 11:08 mikemccand

Amazingly, ChatGPT also got the gist of it. Wild. https://chatgpt.com/c/689edb07-79c8-8326-bfe2-6e9b28f5a452

Curiously ChatGPT won't load this conversation ... just a red error box saying "Unable to load conversation....". Are you sure you published / made it public? Sharing is caring.

mikemccand avatar Aug 15 '25 11:08 mikemccand

Maybe setting -Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits to a smallish value (defaults to 1024) would work too, not sure.

I tested -Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits=1 and it does indeed prevent the leak with my small test case on 9.12.x, JDK 21:

MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]
MAP COUNT (from /proc/234634/maps): 508 [0 si files]

I assume my non sysprop fix also fixes the issue:

As a quick workaround for 9.x one could use:

mmapDir.setGroupingFunction(fn -> fn.endsWith(".si") ? Optional.empty() : MMapDirectory.GROUP_BY_SEGMENT.apply(fn));

Uwe

uschindler avatar Aug 15 '25 11:08 uschindler

I tested 64 max permits (the fix in https://github.com/apache/lucene/pull/15078) and we leak for a bit and then the leaks stop, yay!

...
MAP COUNT (from /proc/235240/maps): 928 [420 si files]
MAP COUNT (from /proc/235240/maps): 938 [430 si files]
MAP COUNT (from /proc/235240/maps): 948 [440 si files]
MAP COUNT (from /proc/235240/maps): 958 [450 si files]
MAP COUNT (from /proc/235240/maps): 968 [460 si files]
MAP COUNT (from /proc/235240/maps): 978 [470 si files]
MAP COUNT (from /proc/235240/maps): 988 [480 si files]
MAP COUNT (from /proc/235240/maps): 998 [490 si files]
MAP COUNT (from /proc/235240/maps): 1008 [500 si files]
MAP COUNT (from /proc/235240/maps): 1018 [510 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]
MAP COUNT (from /proc/235240/maps): 1028 [520 si files]

So this bounds the max maps for any held-open index segment to 64. Realistically, any app that keeps loading the segments file is presumably doing so because the index is changing and segments are turning over, and then the arenas will be closed, even with leaking. So the real-world exposure risk seems relatively low, but I'm still curious about https://github.com/apache/lucene/issues/15054.

mikemccand avatar Aug 15 '25 11:08 mikemccand