lucene LUCENE-10425：speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

first we add common function for caller to binary search in bkdPointTree.

Generally for log data, when indexd sort in ascend order by @timestamp field, when we want to run count aggregation query to find the count of document in many time interval, we can use the binary search to find out the min/max docId between on time interval, and the doc count=max docId- min docId +1

also in this pr, we speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction if the field is with bkd index and index sort by ascend order.

Feb 17 '22 12:02 wjp719

This looks very similar to the implementation of Weight#count on PointRangeQuery and should only perform marginally faster? It's uncreal to me whether this PR buys us much.

Feb 23 '22 13:02 jpountz

This looks very similar to the implementation of Weight#count on PointRangeQuery and should only perform marginally faster? It's uncreal to me whether this PR buys us much.

Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator when create Scorer instead of using docvalue to binary search to find out min/max docId.

As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, it may need to create a new SortedNumericDocValues instance and advance from the first doc many times, so it will be more cpu and IO consuming.

I also add a variable allDocExist in BoundedDocSetIdIterator to label if all doc between min/max doc exists, so in the BoundedDocSetIdIterator#advance() method, it will skip to call the delegate.advance() to check if the doc exists

benchmark result

I also test this pr performance with main branch:

dataset

I use two dataset, the small dataset is the httpLog with about 200million doc the big one is our application log with 1.4billion doc

query

query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is

"query": {
    "bool": {
      "must": [
        {
          "range": {
            "@timestamp": {
             "gte": "1998-06-08T05:00:01Z",
              "lt": "1998-06-15T00:00:00Z"
            }
          }
        },
        {
          "match": {
            "status": "200"
          }
        }
      ]
    }
  }

result

with es rally tool. (it run many times, so the disk data is cached) use rally to compare performance of httLog small dataset

|                                                        Metric |          Task |    Baseline |   Contender |     Diff |   Unit |
|                                                Min Throughput | 200s-in-range |    9.54473 |     10.1261 |  0.58137 |  ops/s |
|                                               Mean Throughput | 200s-in-range |    9.58063 |       10.14 |  0.55941 |  ops/s |
|                                             Median Throughput | 200s-in-range |     9.5815 |     10.1405 |  0.55902 |  ops/s |
|                                                Max Throughput | 200s-in-range |    9.61395 |     10.1524 |  0.53847 |  ops/s |
|                                       50th percentile latency | 200s-in-range |    40581.6 |     37628.2 | -2953.41 |     ms |
|                                       90th percentile latency | 200s-in-range |    43334.5 |     40271.4 | -3063.11 |     ms |
|                                       99th percentile latency | 200s-in-range |    43949.5 |     40868.3 |  -3081.2 |     ms |
|                                      100th percentile latency | 200s-in-range |    44016.2 |     40933.7 | -3082.54 |     ms |
|                                  50th percentile service time | 200s-in-range |    98.6711 |     96.5836 |  -2.0875 |     ms |
|                                  90th percentile service time | 200s-in-range |    100.634 |      97.772 | -2.86207 |     ms |
|                                  99th percentile service time | 200s-in-range |    121.701 |     99.2353 | -22.4661 |     ms |
|                                 100th percentile service time | 200s-in-range |    127.223 |     99.4463 | -27.7768 |     ms

manually run one time（clear all cache）

|            dataSet|main branch latency|this pr latency|latency improvement|
|            httpLog|              267ms|          167ms|               -38%|
|our application log|             2829ms|         1093ms|               -62%|

Feb 27 '22 05:02 wjp719

I like this idea but I agree with Adrien that the API change does not look right. More over, I don't think we need to add it to PointValues as this is something specific for Index sorting and it won't apply in all situation. I think it makes little sense when we use the kd-tree as an effective r-tree like in range fields.

I think it is possible to add a specific implementation on IndexSortSortedNumericDocValuesRangeQuery that does not require on an API change. I came up with this implementation, something closer to that should work?

  /**
   * Returns the first document whose packed value is greater than or equal to the provided packed value
   * or -1 if all packed values are smaller than the provided one,
   */
  public final int nextDoc(PointValues values, byte[] packedValue) throws IOException {
      final int numIndexDimensions = values.getNumIndexDimensions();
      final int bytesPerDim = values.getBytesPerDimension();
      final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
      final Predicate<byte[]> biggerThan = testPackedValue -> {
          for (int dim = 0; dim < numIndexDimensions; dim++) {
              final int offset = dim * bytesPerDim;
              if (comparator.compare(packedValue, offset, testPackedValue, offset) <= 0) {
                  return false;
              }
          }
          return true;
      };
      return nextDoc(values.getPointTree(), biggerThan);
  }
    
  private int nextDoc(PointValues.PointTree pointTree, Predicate<byte[]> biggerThan) throws IOException {
      if (biggerThan.test(pointTree.getMaxPackedValue())) {
          // doc is before us
          return -1;
      } else if (pointTree.moveToChild()) {
          // navigate down
          do {
              final int doc = nextDoc(pointTree, biggerThan);
              if (doc != -1) {
                  return doc;
              }
          } while (pointTree.moveToSibling());
          pointTree.moveToParent();
          return -1;
      } else {
          // doc is in this leaf
          final int[] doc = {-1};
          pointTree.visitDocValues(new IntersectVisitor() {
              @Override
              public void visit(int docID) {
                  throw new AssertionError("Invalid call to visit(docID)");
              }

              @Override
              public void visit(int docID, byte[] packedValue) {
                  if (doc[0] == -1 && biggerThan.test(packedValue) == false) {
                      doc[0] = docID;
                  }
              }

              @Override
              public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
                  return Relation.CELL_CROSSES_QUERY;
              }
          });
          assert doc[0] != -1;
          return doc[0];
      }
  }

Feb 28 '22 07:02 iverase

@iverase , Hi I move the bkd binary search to the IndexSortSortedNumericDocValuesRangeQuery as you suggested, please help to review it, thanks.

Feb 28 '22 13:02 wjp719

@iverase I add a random test, please review it again

Mar 01 '22 02:03 wjp719

@iverase I have simplify dimensions to 1, please review again

Mar 01 '22 09:03 wjp719

One thing I'm not totally happy with is that the change assumes that the field is either not indexed or indexed with LongPoint while this query would previousy also work with IntPoint. Things like this would be easier to do once we know more about both how the field is indexed and how doc values have been created, e.g. via LUCENE-10162.

Mar 01 '22 13:03 jpountz

@jpountz @iverase now ./gradlew check all passed in my local machine, please help to run check again

Mar 02 '22 02:03 wjp719

One thing I'm not totally happy with is that the change assumes that the field is either not indexed or indexed with LongPoint while this query would previousy also work with IntPoint. Things like this would be easier to do once we know more about both how the field is indexed and how doc values have been created, e.g. via LUCENE-10162.

@jpountz Hi, what else should I do for this pr?

Mar 03 '22 02:03 wjp719

@jpountz Hi, please help to take some time to review this pr again, thanks

Mar 15 '22 15:03 wjp719

@jpountz @iverase Hi, is this pr ok now?

Mar 29 '22 02:03 wjp719

Thanks, this looks good to me! Can you add a CHANGES entry with your name under 9.5?

Thanks a lot, I have added the Change entry.

And This PR has a limitation that only index sorting by ascend order can use bkd binary search to get min/max docId. The main reason is that bkd now sorts point by two dimension (point value, docId) both in ascend order. If doc is index-sorted by ascend order, then all the docId of all leaf point will be monotone increasing, so we can use bkd binay search.

In our local work, if doc is index-sorted by descend order, we modify the bkd sorting logic by (point value in ascend order , docId in descend order), so that all the docId of all leaf point will be monotone decreasing, then we can use bkd binay search again.

So May I open another PR to add an option that BKD can sort by (point value in ascend order , docId in descend order)? then the bkd binary search can work in both ascend/descend index sorting

Sep 22 '22 01:09 wjp719

I was wondering about descending sorts too! Do we actually need to make this configurable on BKD trees, I would rather not add this option and make the binary search logic a bit more complex/inefficient.

Sep 22 '22 06:09 jpountz

I would rather not add this option and make the binary search logic a bit more complex/inefficient.

OK thanks, when index sorts on descending order, I have tried bkd binary search when with origin bkd, but when the count of same point value is up to 100 thousand, the bkd binary search time is equal to docvalue binary search. In my trial, I need to load all same maximum point value to get the min/max docId, maybe there are other opimizations.

Sep 22 '22 07:09 wjp719

I'm (maybe naively) assuming that we could work around this case at the inner node level by skipping inner nodes whose max value is equal to the min value if we have already seen this value before?

Sep 22 '22 08:09 jpountz

I'm (maybe naively) assuming that we could work around this case at the inner node level by skipping inner nodes whose max value is equal to the min value if we have already seen this value before?

sure, the inner node can be skipped , but for the boundary value, such as the range is from 1663837201000 to 1663839001000. we need to load all leaf block that with point value is 1663839001000 or 1663837201000. if there are 100 thousand doc with point value is 1663839001000 or 1663837201000, we need to load many leaf block to get the min/max docId. these block maybe cannot be skipped?

this case is the real data that there are 60 billion doc per day, and the timestamp is in second granularity, the average doc per second is 100 thousand.

Sep 22 '22 09:09 wjp719