druid icon indicating copy to clipboard operation
druid copied to clipboard

add new typed in filter

Open clintropolis opened this issue 11 months ago • 0 comments

Description

Adds a new TypedInFilter similar to the work done for equality and range filters in #14542, as a replacement for InDimFilter which deals in native match value types instead of only supporting string sets. This results in a pretty decent performance increase particularly for matching numeric columns, both when using indexes and matchers, since we don't need to convert the values to do the comparison when the filter match value type matches the underlying column type.

The SQL planner uses new TypedInFilter whenever sqlUseBoundAndSelectors is set to false, which is itself by default tied to druid.generic.useDefaultValueForNull=false (also the default). This filter does not support default value mode at all, so if druid.generic.useDefaultValueForNull=true then this filter will not be used even if sqlUseBoundAndSelectors is set to false.

The JSON creator for TypedInFilter:

  @JsonCreator
  public TypedInFilter(
      @JsonProperty("column") String column,
      @JsonProperty("matchValueType") ColumnType matchValueType,
      @JsonProperty("values") @Nullable List<?> values,
      @JsonProperty("sortedValues") @Nullable List<?> sortedValues,
      @JsonProperty("filterTuning") @Nullable FilterTuning filterTuning
  )

accept either values which may or may not be sorted (the constructor checks), and sortedValues which is trusted to be sorted. Nearly all callers should just set values unless presorting the values, and if the values are already sorted then the O(n) scan of them to check for sortedness should be relatively cheap and will automatically move them to sortedValues. The sorting if needed is done lazily, only when doing things like json serialization, computing cache key, or checking if filters are equivalent. The idea is that the values can be sorted by the broker and then be guaranteed to be sorted when serialized to send to the historicals so that they don't have to waste any effort sorting the values.

Unfortunately the SQL planner does serialize these filters when explaining queries, which is the idea behind checking if the values are pre-sorted (and also the same type as the matchValueType parameter) in the constructor, as much of the SQL planning will actually have the values already sorted, the only cases which are not are things like join and lookup rewrites. Hopefully this will be a bit cheaper than building the sorted set that InDimFilter uses.

Speaking of sorted sets, this new filter uses plain java lists instead of a sorted set, and then Collections.binarySearch to locate items within them. Since these value sets are immutable by the time we need to find things in them, the overhead of sorted sets doesn't really bring much to the table, so this doing this binary search instead is slightly faster as well.

IN filters on numeric columns:

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46)",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        false  avgt    5  316.998 ±  2.726  ms/op
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        force  avgt    5  318.369 ±  3.914  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        false  avgt    5  171.867 ±  2.913  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        force  avgt    5  171.676 ±  2.216  ms/op
after:
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        false  avgt    5  265.012 ±  8.229  ms/op
SqlNestedDataBenchmark.querySql       22           5000000  explicit              none        force  avgt    5  273.842 ±  7.391  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        false  avgt    5  171.550 ±  1.633  ms/op
SqlNestedDataBenchmark.querySql       22           5000000      auto              none        force  avgt    5  173.540 ±  2.413  ms/op

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46) GROUP BY 1",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        false  avgt    5  365.012 ±  6.052  ms/op
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        force  avgt    5  250.669 ±  3.304  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        false  avgt    5  218.247 ±  1.382  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        force  avgt    5  162.742 ±  4.829  ms/op
after:
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        false  avgt    5  307.309 ± 10.928  ms/op
SqlNestedDataBenchmark.querySql       24           5000000  explicit              none        force  avgt    5  237.739 ±  8.574  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        false  avgt    5  213.082 ±  3.638  ms/op
SqlNestedDataBenchmark.querySql       24           5000000      auto              none        force  avgt    5  155.719 ±  3.434  ms/op

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46, 50, 51, 55, 60, 61, 66, 68, 69, 70, 77, 88, 90, 92, 93, 94, 95, 100, 101, 102, 104, 109, 111, 113, 114, 115, 120, 121, 122, 134, 135, 136, 140, 142, 150, 155, 170, 172, 173, 174, 180, 181, 190, 199, 200, 201, 202, 203, 204)",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        false  avgt    5  858.528 ±  6.713  ms/op
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        force  avgt    5  860.413 ± 10.305  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        false  avgt    5  146.786 ±  5.952  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        force  avgt    5  154.059 ±  8.155  ms/op
after:
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        false  avgt    5  260.688 ±  2.862  ms/op
SqlNestedDataBenchmark.querySql       48           5000000  explicit              none        force  avgt    5  256.196 ±  4.310  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        false  avgt    5  177.440 ±  1.757  ms/op
SqlNestedDataBenchmark.querySql       48           5000000      auto              none        force  avgt    5  147.445 ±  3.371  ms/op

"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46, 50, 51, 55, 60, 61, 66, 68, 69, 70, 77, 88, 90, 92, 93, 94, 95, 100, 101, 102, 104, 109, 111, 113, 114, 115, 120, 121, 122, 134, 135, 136, 140, 142, 150, 155, 170, 172, 173, 174, 180, 181, 190, 199, 200, 201, 202, 203, 204) GROUP BY 1",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        false  avgt    5  947.775 ± 12.952  ms/op
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        force  avgt    5  377.426 ± 10.417  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        false  avgt    5  220.622 ±  6.089  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        force  avgt    5  163.623 ±  1.712  ms/op
after:
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        false  avgt    5  325.515 ±  7.563  ms/op
SqlNestedDataBenchmark.querySql       50           5000000  explicit              none        force  avgt    5  255.146 ±  4.047  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        false  avgt    5  216.967 ±  7.754  ms/op
SqlNestedDataBenchmark.querySql       50           5000000      auto              none        force  avgt    5  157.717 ±  4.723  ms/op

"SELECT long2 FROM foo WHERE double3 IN (1.0, 19.0, 21.0, 23.0, 25.0, 26.0, 46.0, 50.0, 51.0, 55.0, 60.0, 61.0, 66.0, 68.0, 69.0, 70.0, 77.0, 88.0, 90.0, 92.0, 93.0, 94.0, 95.0, 100.0, 101.0, 102.0, 104.0, 109.0, 111.0, 113.0, 114.0, 115.0, 120.0, 121.0, 122.0, 134.0, 135.0, 136.0, 140.0, 142.0, 150.0, 155.0, 170.0, 172.0, 173.0, 174.0, 180.0, 181.0, 190.0, 199.0, 200.0, 201.0, 202.0, 203.0, 204.0)",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        false  avgt    5  773.810 ±  7.200  ms/op
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        force  avgt    5  772.494 ±  3.224  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        false  avgt    5    2.169 ±  0.218  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        force  avgt    5    2.205 ±  0.217  ms/op
after:
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        false  avgt    5  108.259 ±  2.217  ms/op
SqlNestedDataBenchmark.querySql       52           5000000  explicit              none        force  avgt    5  108.179 ±  1.548  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        false  avgt    5    2.070 ±  0.144  ms/op
SqlNestedDataBenchmark.querySql       52           5000000      auto              none        force  avgt    5    2.019 ±  0.133  ms/op

"SELECT long2 FROM foo WHERE double3 IN (1.0, 19.0, 21.0, 23.0, 25.0, 26.0, 46.0, 50.0, 51.0, 55.0, 60.0, 61.0, 66.0, 68.0, 69.0, 70.0, 77.0, 88.0, 90.0, 92.0, 93.0, 94.0, 95.0, 100.0, 101.0, 102.0, 104.0, 109.0, 111.0, 113.0, 114.0, 115.0, 120.0, 121.0, 122.0, 134.0, 135.0, 136.0, 140.0, 142.0, 150.0, 155.0, 170.0, 172.0, 173.0, 174.0, 180.0, 181.0, 190.0, 199.0, 200.0, 201.0, 202.0, 203.0, 204.0) GROUP BY 1",
Benchmark                        (query)  (rowsPerSegment)  (schema)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        false  avgt    5  904.097 ±  9.795  ms/op
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        force  avgt    5  242.745 ±  5.692  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        false  avgt    5  128.160 ±  2.043  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        force  avgt    5  129.161 ±  5.200  ms/op
after:
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        false  avgt    5  230.324 ± 12.410  ms/op
SqlNestedDataBenchmark.querySql       54           5000000  explicit              none        force  avgt    5  201.999 ±  4.346  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        false  avgt    5  125.256 ± 18.262  ms/op
SqlNestedDataBenchmark.querySql       54           5000000      auto              none        force  avgt    5  123.719 ±  8.906  ms/op

IN filters on string columns:

SELECT * FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100')

before:
Benchmark              (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        false  avgt    5  272.734 ± 18.392  ms/op
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        force  avgt    5  272.081 ± 31.917  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        false  avgt    5  232.986 ± 14.621  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        force  avgt    5  229.473 ± 11.148  ms/op
after:
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        false  avgt    5  260.132 ±  5.194  ms/op
SqlBenchmark.querySql       24           5000000  explicit           mmap              none        force  avgt    5  263.946 ±  4.753  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        false  avgt    5  228.280 ± 12.907  ms/op
SqlBenchmark.querySql       24           5000000      auto           mmap              none        force  avgt    5  228.460 ± 10.062  ms/op


SELECT dimSequential, dimZipf, SUM(sumLongSequential) FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100') GROUP BY 1, 2

before:
Benchmark              (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        false  avgt    5   26.105 ±  2.186  ms/op
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        force  avgt    5   22.348 ±  1.196  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        false  avgt    5   25.959 ±  0.923  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        force  avgt    5   22.445 ±  1.647  ms/op
after:
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        false  avgt    5   25.170 ±  0.654  ms/op
SqlBenchmark.querySql       26           5000000  explicit           mmap              none        force  avgt    5   21.514 ±  1.403  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        false  avgt    5   25.258 ±  0.501  ms/op
SqlBenchmark.querySql       26           5000000      auto           mmap              none        force  avgt    5   21.564 ±  0.824  ms/op

The string results are pretty close since most of the internals are identical, though still shows a slight improvement from using the sorted list instead of sorted set.

Release note

TBD


This PR has:

  • [ ] been self-reviewed.
    • [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
  • [ ] added documentation for new or modified features or behaviors.
  • [ ] a release note entry in the PR description.
  • [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • [ ] added or updated version, license, or notice information in licenses.yaml
  • [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • [ ] added integration tests.
  • [ ] been tested in a test Druid cluster.

clintropolis avatar Mar 05 '24 05:03 clintropolis