quickwit
quickwit copied to clipboard
More efficient on multi-indexes.
Currently, each index pipelines is spawning a lot of threads: Blocking actors are executed in their own single threaded tokio runtime.
In addition, tantivy's IndexWriter allocate a large buffer to build its index. By running several pipeline in parallel, we end up with an overall memory footprint proportional to the number of indexes. (at least as a first approximation).
Finally, the number of file descriptor is also an issue.
For multi-indexes, we want to keep the individual index pipelines, as it proved to be very simple to reason with, but we want to play on the orchestration to remove all of the above negative side-effects.
Packager & Merger are special. They have to run in a
The solution will require:
Allowing actors to run on a specific tokio runtime
The difference between blocking / async actors will be even lighter than it was before. It should make it possible to simplify the actor framework code even further. For instance, we will lose the need for a JoinHandle.
The ingest queues actor is blocking (as it relies on rocksdb). We need to have it run on its own single threaded runtime.
Indexer scheduling
While the packager and the merger naturally should yield in between message, the indexer is different.
We want the indexer to yield once the IndexWriter is sent to the packager. Also, we do not want to always wait for the scheduled commit. If there are no more messages associated to an index, we want to stop indexing.
A robust and simple scheduling logic could be as follows:
- Start indexing record time t.
- The indexer should NOT yield in between messages.
- If no more messages are available or if
COMMIT_TIMEOUT
is passed, close the split and send it to the packager, yield and "go to sleep". - Attempt to resume indexing at
t + COMMIT_TIMEOUT
- [ ] Making the tokio runtime the actors run on configurable.
- [ ] Change the scheduling logic of indexers