vespa icon indicating copy to clipboard operation
vespa copied to clipboard

Add a MultiRangeItem for efficient evaluation of multiple OR-ed range items on a common field (or two)

Open jonmv opened this issue 3 years ago • 8 comments

I'm always frustrated when it takes too long to evaluate an OrItem with lots of RangeItems inside.

This is typically the case when using the result of one query to prune the result set for a second query, to be intersected with the first. When the first query has 100k hits, each resulting in a range to be added o the second query, this becomes expensive:

  1. The resulting query tree has hundreds of thousands of nodes, and serialisation and processing of this in backend is slow.
  2. Doing (on average half of) 100k ORs in sequence, when evaluating the query in the backend, is very costly.

I would like a new MultiRangeItem to be added to the Java search API. This item would restrict the contained ranges to all be against the same field, or possibly a pair of fields (start in one field, end in another, when documents contain intervals, not just points). This, again, would allow for a much more compact representation of the query, and would allow the backend to process it more efficiently. This addresses the first item above.

Additionally, an implementation of this could prune overlapping interval, as well as shape the sorted set of ranges into a search tree; this allows query evaluation in the backend to evaluate only the relevant branches and ranges for each document, in contrast to when the ranges are arranged in a flat, long list.

Finally, it would also allow dedicated code to run in the backend, if we add this at a later time.

jonmv avatar Sep 21 '21 08:09 jonmv

I think if we do multirange query item we should have a range field type which specifies start and end? Maybe you could list a few concrete examples @jonmv ?

jobergum avatar Sep 21 '21 08:09 jobergum

boolean closed = true;
new MultiRangeItem("my_range", closed).addRange(3, 5).addRange(4, 8)... // range type in document, if we add this
new MultiRangeItem("my_start", "my_end", ! closed).addRange(1, 2)... // pair of fields is a range in each document
new MultiRangeItem("my_point", closed).addRange(8, 9)... // single field in each document

new MultiPointItem("my_start", "my_end").addPoint(3).addPoint(4)... // range in document, point in query
new MultiPointItem("my_point").addPoint(3)... // alternative to WeightedSetItem

jonmv avatar Sep 21 '21 08:09 jonmv

To summarise today's discussion:

  1. We would like a dedicated Java API, like the one suggested here, which allows more assumptions on the query term:
  2. we can collapse overlapping intervals;
  3. we can choose a more compact representation; and
  4. we will also be able to build a reverse index in the backend, at search time, for lookup of document terms, making this an efficient filter term.

These are separate tasks that should be addressed in this order.

NB: The fact that the query tree built in Java in experiments actually is more efficient than a flat OR list is at the mercy of the query evaluation strategy gods, who, in turn, rely on std::sort.

jonmv avatar Sep 21 '21 13:09 jonmv

Pondered a bit. The reverse index, in its simplest form, can just be two sorted arrays A, B, containing the range endpoints. Looking up whether a document range (x, y) intersects any range in (A, B) can be done by doing a binary search for a in B and for b in A, and comparing the indices. Specifically, find the smallest j s.t. a ≤ B[j] and the largest i s.t. b ≥ A[i], and it's a match iff i ≥ j. Use strict inequalities for open ranges.

A = [1, 5, 9]
B = [3, 7, 11]
// closed ranges
a, b = 6, 6 ==> i = 1, j = 1 ==> match // (a, b) contained in query range
a, b = 4, 4 ==> i = 0, j = 1 ==> miss  // no intersection
a, b = 4, 8 ==> i = 1, j = 1 ==> match // (a, b) contains query range
a, b = 4, 6 ==> i = 1, j = 1 ==> match // (a, b) overlaps with query range
a, b = 5, 5 ==> i = 1, j = 1 ==> match // (a, b) touches a query range
// open range
a, b = 5, 5 ==> i = 0, j = 1 ==> miss  // (a, b) touches a query range

I suppose it's more cache friendly to store A and B interleaved? This also makes early termination easy, if the two searches diverge. This is possibly the serialisation format we want as well? What do you think, @havardpe?

jonmv avatar Sep 23 '21 06:09 jonmv

MultiRangeItem is too special. We want a MultiTermItem where all term nodes are of the same type. I also think that we should add an operation type to it, OR/AND. Serialisation format should be simple and extendable. num_terms, term_type, operation_type,[terms]

baldersheim avatar Sep 23 '21 07:09 baldersheim

We need something more than that to express ranges in the query against ranges in the document in a clean way. Either a new field type, or range items that specify a pair of fields.

jonmv avatar Sep 23 '21 08:09 jonmv

Clarification: the proposed API is OK, but serialisation should be more general.

jonmv avatar Sep 23 '21 08:09 jonmv

So the serialisation format needs to hold

  1. operation type (AND, OR for now)
  2. number of terms (positive integer)
  3. term blueprint for translating each following term to something complete, e.g., type: ranges, from: <index name>, to: <index name>, start_limit: inclusive, end_limit: exclusive, and then ranges indicates a sub-routine that parses each following term as a pair of numbers—one for each given index—using these common properties.
  4. the actual list of terms, e.g., 1 2 3 4 5 6 for three ranges terms.

That is, we want more than just the term type; we should also put anything common to all contained terms up on top.

I'm not sure what other types of terms we would care about—perhaps weightedset could use this format as well? Multiple words?

jonmv avatar Mar 18 '22 14:03 jonmv