More efficient on multi-indexes.

Open fulmicoton opened this issue 2 years ago • 0 comments

Currently, each index pipelines is spawning a lot of threads: Blocking actors are executed in their own single threaded tokio runtime.

In addition, tantivy's IndexWriter allocate a large buffer to build its index. By running several pipeline in parallel, we end up with an overall memory footprint proportional to the number of indexes. (at least as a first approximation).

Finally, the number of file descriptor is also an issue.

For multi-indexes, we want to keep the individual index pipelines, as it proved to be very simple to reason with, but we want to play on the orchestration to remove all of the above negative side-effects.

Packager & Merger are special. They have to run in a

The solution will require:

Allowing actors to run on a specific tokio runtime

The difference between blocking / async actors will be even lighter than it was before. It should make it possible to simplify the actor framework code even further. For instance, we will lose the need for a JoinHandle.

The ingest queues actor is blocking (as it relies on rocksdb). We need to have it run on its own single threaded runtime.

Indexer scheduling

While the packager and the merger naturally should yield in between message, the indexer is different.

We want the indexer to yield once the IndexWriter is sent to the packager. Also, we do not want to always wait for the scheduled commit. If there are no more messages associated to an index, we want to stop indexing.

A robust and simple scheduling logic could be as follows:

Start indexing record time t.
The indexer should NOT yield in between messages.
If no more messages are available or if COMMIT_TIMEOUT is passed, close the split and send it to the packager, yield and "go to sleep".
Attempt to resume indexing at t + COMMIT_TIMEOUT

[ ] Making the tokio runtime the actors run on configurable.
[ ] Change the scheduling logic of indexers

May 24 '22 12:05 fulmicoton

quickwit quickwit copied to clipboard

More efficient on multi-indexes.

Allowing actors to run on a specific tokio runtime

Indexer scheduling

quickwit
quickwit copied to clipboard