This ticket is actually about two similar problems:

Merge filtering.

Right now a MergePlanner is instantiated for each index pipeline. It is initialized withthe list of existing splits as given by the metastore and updates itself with the newly created splits.

If several pipeline targetting the same index exist on the same node, or if on several nodes, merge planner will spawn conflicting merge operations, ending up as a failure on publish, respawning the entire indexing pipeline.

In the future the plan is to handle this problem by centralizing the merge planning on a node responsible for the control plane.

However, we will probably have a need for a workaroudn to allow some clients to distribute their indexing using #1794.

Merge siloes

We will soon introduce index partitioning, a functionality in which indexers are producing many splits that attempt to isolates tags. Merging splits from different tags would defeat the purpose.

We need a way to tell the merge policy to consider the different partition independently.

Solution

We want to isolate the extract the merge pipeline out of the regular pipeline. After the extraction, we should have:

one merge planer per index.
one supervized merge pipeline per index (downloader, executor, packager, uploader, publisher)
one indexing pipeline per source

If there are several source on the same node indexing the same indexer, their respective publisher will route published splits to the same merge pipeline (associated to the index).

We can assume that the merge planner is simple enough to never die, and does not require supervision.

The rest of the pipeline however is tricky. The simplest is probably to have a supervised pipeline containing merge executor, packager, uploader, publisher. (no sequencer needed)

We need a mechanism for the merge planner to know about failed merge. The merge pipeline supervisor could for instance send the merge planner a message to trigger the reload of the information from the SQL database (or we could have some Drop Guard travel with the splits... The Drop guard would lock/unlock the splits).

Once this is done, we only need to add a node_id to SplitMetadatas and have the merge planner only consider splits with the right node id. (Source_id is not necessary)

Finally we want, we want the merge planner to siloe the splits per partition_id (as added in #1821) and only consider merging split with the same partition_id.

Jul 25 '22 09:07 fulmicoton

Assigning to @guilload as I'm going in holiday and this might be blocking.

Aug 01 '22 06:08 fulmicoton

Merge siloes and filtering is done in #1865

But we can track other parts in new issues:

The simplest is probably to have a supervised pipeline containing merge executor, packager, uploader, publisher. (no sequencer needed)

We need a mechanism for the merge planner to know about failed merge. The merge pipeline supervisor could for instance send the merge planner a message to trigger the reload of the information from the SQL database (or we could have some Drop Guard travel with the splits... The Drop guard would lock/unlock the splits).

Sep 12 '22 15:09 fmassot

quickwit quickwit copied to clipboard

Merge siloes and filtering

Merge filtering.

Merge siloes

Solution

quickwit
quickwit copied to clipboard