Remove mmap isLoaded check before madvise
The counter guarding the madvise callback is not all that effective in preventing madvise calls on slices since it's not shared with the original input and this keeps showing up hot in profiling ES benchmarks.
An outright madvise call should be about as expensive as the isLoaded check when things are already in the page cache and cheaper than that same call + a call to isLoaded that returns false, simply by avoiding another roundtrip to userspace.
-> remove the isLoaded guard to speed up call cases
Excerpt from ES profile motivating this:
Checking the page table for loaded is comparable in cost to many thousands of cycles in practice. Plus, the racy nature of such a check reduces its value in memory constrained scenarios.
The overhead of MS::isLoaded is certainly not a good tradeoff here, as can be seen from the profile that you posted @original-brownbear. Given the new code, consecutivePrefetchHitCount doesn't seem effective now at all.
Additionally, and orthogonally, I'm less sure of the checks anyway, since they don't count the against the actual memory range but rather against the whole segment.
An outright madvise call should be about as expensive as the isLoaded check when things are already in the page cache
The PR where consecutivePrefetchHitCount was introduced had a benchmark that demonstrated an improvement with the current logic: https://github.com/apache/lucene/pull/13381#issuecomment-2118319218. That said, this benchmark had a different access pattern compared to this one, which I guess is terms dictionary lookups performed by TermQuery, so I can believe that it's not always better. I wonder if you are able to create a benchmark that shows an improvement with this change.
The counter guarding the madvise callback is not all that effective in preventing madvise calls on slices since it's not shared with the original input
Thinking out loud, this isLoaded() check only makes sense when prefetch() is called multiple times on the same IndexInput, which is never the case for terms dictionary lookups (as you pointed out). So another way of improving things here may be to disable the isLoaded() check on the first (or first few) calls to prefetch().
Additionally, and orthogonally, I'm less sure of the checks anyway, since they don't count the against the actual memory range but rather against the whole segment.
These checks make an assumption that if many ranges of bytes on which prefetch() has been called were already loaded in memory, then there are good chances that most of the segment is loaded in memory.
@jpountz
was introduced had a benchmark that demonstrated an improvement with the current logic
Huh those results are quite unexpected I must admit :) When me and Chris reasoned about this, the idea was that madvise ("will need") over a range that is fully cached would be the same cost as the isLoaded check. If that assumption holds this PR would be a clear winner but it seems that benchmark suggests otherwise?
Well, you may be right as well that the cost of MS::isLoaded is of a similar order of magnitude as madvise. What the current logic does is that if you get MS::isLoaded to frequently return true (suggesting that a good share of the MS is loaded), then both the call to MS::isLoaded and the call to madvise will be skipped, this is likely where the speedup from the benchmark that I linked is coming from.
What do you think of skipping the isLoaded() check for the first per-IndexInput call to prefetch()? This should help in the case that you identified (single prefetch() call per IndexInput slice instance), while not defeating the current logic to skip calls to MS::isLoaded / madvise for index inputs that appear to mostly be loaded already when doing several prefetch() calls on the same slice.
@jpountz I see. Hmm I wonder how much that saves us? Seems we just trade an isLoaded for an madvise on systems with enough memory? That said, maybe the madvise is far cheaper because it doesn't necessarily cover the whole segment like the isLoaded check?
EDIT: sorry scratch that question, both run against the same code, I misread things. So not sure how much that buys us on the fast path. It seems fundamentally, with it comes down to on my machine when benchmarking is that if everything's in the cache then either madvise or isLoaded incur significant overhead for no gain. Not sure there's a no-tradeoffs way out of this, probably not :) The faster your disk and the more memory you have the more pronounced the downsides of this logic and vice-versa for the gains :)
Maybe the trick here would be to keep track of the loaded state in a way that works across slices so that we throttle more efficiently?
Seems we just trade an isLoaded for an madvise on systems with enough memory?
This is correct. I made this suggestion because it was similar to your initial proposal: skipping the MS::isLoaded check and doing a madvise call directly instead.
Maybe the trick here would be to keep track of the loaded state in a way that works across slices so that we throttle more efficiently?
I'm good with exploring sharing state across slices as well. I optimized for simplicity with the current implementation, but we may be able to figure out something that works well across slices as well.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
In our use case (Amazon customer facing product search) our indices are nearly always hot (sometimes not, and that quickly gets exciting) and Lucene's efforts to prefetch(madvise(WILL_NEED) and maybe isLoaded()) seem to hurt ... we see ~3% improvement in query throughput (red-line QPS) in offline benchmarks at least by forcefully disabling some of the prefetching.
I went digging into the MemorySegment.isLoaded() implementation ... its name sounds so innocuous, like a getter, but it's doing a lot of stuff under that innocuous name. It malloc()s an array (length numPages) to hold per-page 1 or 0 result returned from the Linux kernel's mincore() API -- an innefficient bitset! (one byte per bit! maybe kernel developers thought userland developers might not understand bits/bytes?) -- and then it's O(numPages) since the Linux kernel is digging through its VM page table to set those bits (hmm: I wonder how this works with huge pages, transparent or not?). And holy smokes is the mincore implementation hairy! Then it's another for loop in isLoaded to check the returned array (per-page 0 or 1) to see if every page is loaded, so for the hot case, it's the worst run time I think (run through whole loop to discover every page is loaded).
Gemini (Thinking) goes into good detail here and shows the actual java and C sources. Oh the power and terror of abstraction...
Is this issue about sharing the precached/isLoaded state (MemorySegmentIndexInput.consecutivePrefetchHitCount) from original IndexInput to sliced/cloned ones? Slicing because of compound files? Cloning because all the iterators queries make will start by cloning the MemorySegmentIndexInput, and that clone impl makes a new MemorySegmentIndexInput and resets that clone's consecutivePrefetchHitCount to 0. So every query begins again assuming nothing is prefetched yet, especially penalizing workload that are mostly super fast queries I guess.
Anyway, I don't have any good ideas on what to do about this. Maybe ideally there would be some simple way for the user to turn on/off the prefetch optos? Or if we could reduce the cost/impact of prefetch when things are already (nearly) not? Not sure ...