quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Multiple source for a single index story is broken

Open fulmicoton opened this issue 3 years ago • 0 comments

Currently, we do allow having several sources for a given index. This will come handy for distribution purpose of course, but this is also a necessity in 0.3 as we are adding support for the PushAPI.

Today we make this possible by starting as many complete indexing pipeline as there are sources. This creates problems on the merge side.

Let's imagine a fresh new index with two sources A and B, and their associated indexing pipeline. As we index our first few splits, each MergePlanner (A and B) will only know about splits that were created by the source associated from the pipeline they belong to.

If pipeline A and its MergePlanner is restarted however (after a restart of quickwit or after a fault), the MergePlanner will fetch the list of published split from the metastore at initialization, and will start initiating merges involving split from A and B, conflicting with the work done on Pipeline B.

We have two possible approach here. a) Either we accept merge between sources b) either we don't.

a) intuitively seems more efficient. If we have 10 sources dealing with 10 different kafka partitions, keeping the merge independent means:

  • hurting the number of small segments
  • hurting time-pruning. Unfortunately it hides a lot of complexity. It would suggest separating the indexing pipeline and the merging pipeline. The merging pipeline would be per-index while the indexing pipeline would be per-source. In the distributed case, the merging pipeline would ideally have to be cache-aware. There is not much benefit merging split cross-source as soon as the splits are available. It is more efficient to just wait for enough splits to be locally in cache and merging those. The last level merge on the other, which is not in cache anyway can then benefit greatly to be cross-source.

We will also likely encounter use cases where tag selectivity will be done upstream... Users will route documents to kafka partition in a non-random manner in order to improve tag selectivity. In that case, merging cross shard arguably hurts tag selectivity ( arguably because this is done to the profit of time pruning)

b) is on the other hand simpler. It can be done by adding a source id to the split meta, and only selecting the right splits in the mergeplanner initialization.

I suggest we stick for solution b) for the moment. It might seem suboptimal but it has the merit to be simple and to act just like regular "sharding".

fulmicoton avatar Apr 07 '22 01:04 fulmicoton