lucene
lucene copied to clipboard
asynchronous I/O + saturating NVMe bandwidth
Description
The intersection of fast NVMe SSDs that offer high I/O concurrency and io_uring is leading to a momentum toward asynchronous request processing. For example, PostgreSQL is the latest example to consider support for io_uring.
Our in-house analysis shows that io_uring provides better throughput for NVMe SSDs than FileChannel#read/write and FileChannel#mmap. Others have reached the same conclusion.
io_uring is a Linux kernel asynchronous I/O interface that allows applications to queue multiple I/O operations into a ring buffer and submit them all with a single syscall, enabling improved batched I/O over AIO and user-level thread-based approaches. What does this development mean for Lucene?
Talking to Mike recently, it does seem there is room for improving query throughput in Lucene when the index cannot fit in available DRAM capacity. It is well-known that DRAM is a limited resource, and datasets grow faster than improved DRAM scaling. In real-time cases, especially, limited DRAM is pressured from new index updates (memtable), query cache, page cache, and Java heap (including garbage). Hence, SSD serves as an extension and optimizing SSD throughput is critical for important use cases out there.
Taking the full advantage of NVMe/io_uring requires submitting large I/O batches. These batches can come from a single request but this limits I/O concurrency to queries with popular terms only. Or they can come from multiple requests. In the latter case, for example, gather read requests for index pages from multiple queries and submit them as a batch. Then, start processing queries when I/Os complete in any order. (Avoiding details of handling corner cases and ordering constraints for now).
In any case, quoting Mike, “Lucene's cross-segment concurrency ("thread per slice" within a single query) helps some with I/O concurrency, but that's still at the whim of the index geometry, which is horribly leaky abstraction e.g., an "optimized" index (single segment) loses all concurrency!”
I believe taking full advantage of asynchronous I/O requires a shift to a new posting format.
So, I suggest a radical shift in the underlying posting format. Give up log-structured format (LSM)! Give up segments. LSM aims for sequential I/O, which is no longer relevant for NVMe SSDs. Have one monolithic index. Each term initially gets a page for its posting and related data. If the term is popular, progressively assign it chunks of multiple pages. Pages/chunks are chained using forward pointers. These pointers will be stored in an on-heap (or in-memory) data structure similar to skip offsets. (Embrace random SSD I/Os.) A single request can be broken down into many random pages on an SSD, and all these pages are part of the batch for io_submit(). On the write path, similarly, gather all updates in a submit buffer and issue hundreds of random I/Os.
Terms are no longer sorted, so range queries may suffer. Still, on-heap FST provides a sorted view of terms. It's just they are not sorted on storage. On the flip side, there is no CPU to waste on sorting and merging. A reasonable tradeoff for the rapidly shifting underlying hardware technology. Saved CPU can serve more queries in the case of real-time search (an important use case for social media platforms).
There are many challenges indeed. What about (crash) consistency? Segments make it easier to deal with failures somewhat gracefully. With some effort, one can maintain logs that contain sufficient information to recover an index in the event of failure. There are fewer index management operations like merging so that helps offset the overhead of maintaining additional logs.
What about real-time search? Currently, the near real-time API requires substantial effort to reopen the index using DirectoryReader.openIfChanged(). My analysis shows its overhead is prohibitive for reasonably sized indexes. The new posting format would require less effort to expose the most recent in-memory changes to index searchers (omitting details).
Is there a better way to utilize io_uring and fully leverage NVMe potential without requiring radical changes to the index format? For example, has someone observed using the new prefetch API coupled with a lot of searcher threads as a way to saturate NVMe? Even then, I believe it does not increase I/O concurrency on the write path. Large I/Os on the write path help, but I believe large I/Os alone do not fully exploit the internal parallelism of modern SSDs.
I just want to start a discussion on this subject after some initial exchanges with Mike. Thoughts?
So, I suggest a radical shift in the underlying posting format. Give up log-structured format (LSM)! Give up segments. LSM aims for sequential I/O, which is no longer relevant for NVMe SSDs. Have one monolithic index. Each term initially gets a page for its posting and related data. If the term is popular, progressively assign it chunks of multiple pages. Pages/chunks are chained using forward pointers. These pointers will be stored in an on-heap (or in-memory) data structure similar to skip offsets. (Embrace random SSD I/Os.) A single request can be broken down into many random pages on an SSD, and all these pages are part of the batch for io_submit(). On the write path, similarly, gather all updates in a submit buffer and issue hundreds of random I/Os.
Terms are no longer sorted
You're free to write your own separate search engine, but this doesn't sound like lucene to me.
Lucene can support various I/O engines: mmap, io_uring, pmem, cxl, etc. This way as storage technology improves, one can reuse all the latest developments in Lucene (e.g., high-dimensionality vectors, faceting, etc) evolved over decades while best exploiting new storage technology.
The question is how much (and what) effort is required for this decoupling to succeed?
But perhaps changes required are probably too radical that a separate effort may be necessary. I guess I just wanted to see opinions around asynchronous I/O and if the current architecture fully exploits the potential of NVMe SSDs.
Give up segments.
To expand on Robert's point: segments are so core to Lucene that removing segments from Lucene is going to be a full rewrite anyway. It sounds easier to start a new project with your idea instead of doing a complete rewrite of Lucene, and then see if it can replace Lucene.
I agree with your point that Lucene cannot easily use io_uring as-is, because there are lots of dependencies between reads (your need to first read X somewhere to know where to look for Y next, etc.), which is incompatible with high parallelism.
Maybe start with a more bite-sized usage of async IO in Lucene?
E.g. there have been discussions about approximate KNN algorithms that use a fast, highly quantized first pass, relying on smallish data structures that could be RAM-hot, and a second rescorer phase that needs random access to the original high precision vectors that would ideally safely remain RAM-cold. I would think saturating IO to a PCIe or NVMe SSD during the rescoring phase would be a nice speedup over the naive "read each vector sequentially" approach?
If you have 1000 coarse top vectors to rescore, you can scatter/gather those 1000 requests (async IO), and as the SSD delivers the bytes (in whatever order) you do the per-vector completion (score it (e.g. dot product) and insert into priority queue or so).
Oh yeah the async IO for KNN rescoring was first (that I saw!) mentioned in this issue.
I don't understand why we would need to move away from segments, to tinker more with async IO? Whatever we could do with a single segment, we could also do with the many segments in a single index? Lucene gets all sorts of nice properties thanks to its write-once segments, e.g. transactional semantics, efficient near-real-time segment replication, etc.
We also have made first steps at virtually splitting a single large segment into N slices (intervals?) so that each of those slices can proceed concurrently, putting more parallelism into the underlying IO requests... but there is still work to do on that for queries that have a high per-segment init cost (e.g. PointRangeQuery, which visits the KD tree up front to build a bitset of all matching docids for each range).
Each term initially gets a page for its posting and related data. If the term is popular, progressively assign it chunks of multiple pages. Pages/chunks are chained using forward pointers. These pointers will be stored in an on-heap (or in-memory) data structure similar to skip offsets
This is actually quite similar to how Lucene stores newly created postings in heap, before writing to an on-disk segment. It's not whole page allocation, but smaller slices of shared byte[]s where a term gets bigger and bigger slices as it gets more and more postings (because it's a popular term). But then to write the segment on disk, that whole term's postings are written contiguously.
The other problem with io_uring is that it doesn't work in many places in practice. for example its still blocked via docker's seccomp-bpf by default, and still responsible for huge number of security issues?
https://github.com/apache/lucene/issues/12615#issuecomment-1888419082
I think if you want to try to use it, it would be challenging enough on its own. Probably not possible to be a default even on linux due to the issues around it. But maybe it could be provided in an opt-in way... there just needs to be other ways where it can't work.
Meanwhile there is easier lower-hanging fruit such as https://lwn.net/Articles/998783/ (I'm on linux 6.14, lucene can't use this feature, only the option of slow direct-io stuff for merges).
I like the idea of trying out async_io (io_uring based?) for KNN vector search. Mike's idea of rescoring top N low precision quantized vector search results has a lot of natural parallelism. Maybe we could try it with the Better Binary Quantizer that was recently added?
See also this PR for reference: https://github.com/apache/lucene/pull/13586, where @benwtrent explores better parallelizing I/O when fetching vectors for the visited nodes of the HNSW graph. It stalled because it made things a bit slower when data fits in RAM, which suggests we need to look into reducing the cost of IndexInput#prefetch for data that already fits in RAM.
I don't understand why we would need to move away from segments, to tinker more with async IO? Whatever we could do with a single segment, we could also do with the many segments in a single index? Lucene gets all sorts of nice properties thanks to its write-once segments, e.g. transactional semantics, efficient near-real-time segment replication, etc.
We also have made first steps at virtually splitting a single large segment into N slices (intervals?) so that each of those slices can proceed concurrently, putting more parallelism into the underlying IO requests... but there is still work to do on that for queries that have a high per-segment init cost (e.g.
PointRangeQuery, which visits the KD tree up front to build a bitset of all matchingdocidsfor each range).
It seems I was perhaps over-thinking changes required to support async IO. For example, I was thinking async IO would saturate SSDs and make CPU the bottleneck. So have one monolithic index to not require merging (or any other segment-related operations.) that consume CPU. And leave all CPU for processing queries since IO will be so fast that CPU will be the new bottleneck in the cold case.
But it does seem segments offer some nice properties beyond offering sequential I/O during index writes from memory to disk. So having segments is a good idea.
There is indeed no reason we cannot have async IO while having segments. There are some challenges. For example, last I recall going in the guts of Lucene - scoring (including priority queue operations) is intertwined with disk reads. But they can overcome.
Lucene has some minimal support for asynchronous I/O via IndexInput#prefetch, which indicates that the given byte range is going to be needed in the near future. It's used to allow users to parallelize I/O when fetching stored fields, or to parallelize I/O in the terms dictionary when evaluating a query against multiple terms. Step 0 could be to implement this API using io_uring and compare performance characteristics with what we're getting with the current implementation that is based on madvise with the MADV_WILL_NEED flag.
Yes, that's a good start. So, this will require us to move away from mmap and use io_uring via FFI.