druid
druid copied to clipboard
add new typed in filter
Description
Adds a new TypedInFilter
similar to the work done for equality and range filters in #14542, as a replacement for InDimFilter
which deals in native match value types instead of only supporting string sets. This results in a pretty decent performance increase particularly for matching numeric columns, both when using indexes and matchers, since we don't need to convert the values to do the comparison when the filter match value type matches the underlying column type.
The SQL planner uses new TypedInFilter
whenever sqlUseBoundAndSelectors
is set to false, which is itself by default tied to druid.generic.useDefaultValueForNull=false
(also the default). This filter does not support default value mode at all, so if druid.generic.useDefaultValueForNull=true
then this filter will not be used even if sqlUseBoundAndSelectors
is set to false.
The JSON creator for TypedInFilter
:
@JsonCreator
public TypedInFilter(
@JsonProperty("column") String column,
@JsonProperty("matchValueType") ColumnType matchValueType,
@JsonProperty("values") @Nullable List<?> values,
@JsonProperty("sortedValues") @Nullable List<?> sortedValues,
@JsonProperty("filterTuning") @Nullable FilterTuning filterTuning
)
accept either values
which may or may not be sorted (the constructor checks), and sortedValues
which is trusted to be sorted. Nearly all callers should just set values
unless presorting the values, and if the values are already sorted then the O(n) scan of them to check for sortedness should be relatively cheap and will automatically move them to sortedValues
. The sorting if needed is done lazily, only when doing things like json serialization, computing cache key, or checking if filters are equivalent. The idea is that the values can be sorted by the broker and then be guaranteed to be sorted when serialized to send to the historicals so that they don't have to waste any effort sorting the values.
Unfortunately the SQL planner does serialize these filters when explaining queries, which is the idea behind checking if the values are pre-sorted (and also the same type as the matchValueType parameter) in the constructor, as much of the SQL planning will actually have the values already sorted, the only cases which are not are things like join and lookup rewrites. Hopefully this will be a bit cheaper than building the sorted set that InDimFilter
uses.
Speaking of sorted sets, this new filter uses plain java lists instead of a sorted set, and then Collections.binarySearch
to locate items within them. Since these value sets are immutable by the time we need to find things in them, the overhead of sorted sets doesn't really bring much to the table, so this doing this binary search instead is slightly faster as well.
IN filters on numeric columns:
"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46)",
Benchmark (query) (rowsPerSegment) (schema) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 22 5000000 explicit none false avgt 5 316.998 ± 2.726 ms/op
SqlNestedDataBenchmark.querySql 22 5000000 explicit none force avgt 5 318.369 ± 3.914 ms/op
SqlNestedDataBenchmark.querySql 22 5000000 auto none false avgt 5 171.867 ± 2.913 ms/op
SqlNestedDataBenchmark.querySql 22 5000000 auto none force avgt 5 171.676 ± 2.216 ms/op
after:
SqlNestedDataBenchmark.querySql 22 5000000 explicit none false avgt 5 265.012 ± 8.229 ms/op
SqlNestedDataBenchmark.querySql 22 5000000 explicit none force avgt 5 273.842 ± 7.391 ms/op
SqlNestedDataBenchmark.querySql 22 5000000 auto none false avgt 5 171.550 ± 1.633 ms/op
SqlNestedDataBenchmark.querySql 22 5000000 auto none force avgt 5 173.540 ± 2.413 ms/op
"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46) GROUP BY 1",
Benchmark (query) (rowsPerSegment) (schema) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 24 5000000 explicit none false avgt 5 365.012 ± 6.052 ms/op
SqlNestedDataBenchmark.querySql 24 5000000 explicit none force avgt 5 250.669 ± 3.304 ms/op
SqlNestedDataBenchmark.querySql 24 5000000 auto none false avgt 5 218.247 ± 1.382 ms/op
SqlNestedDataBenchmark.querySql 24 5000000 auto none force avgt 5 162.742 ± 4.829 ms/op
after:
SqlNestedDataBenchmark.querySql 24 5000000 explicit none false avgt 5 307.309 ± 10.928 ms/op
SqlNestedDataBenchmark.querySql 24 5000000 explicit none force avgt 5 237.739 ± 8.574 ms/op
SqlNestedDataBenchmark.querySql 24 5000000 auto none false avgt 5 213.082 ± 3.638 ms/op
SqlNestedDataBenchmark.querySql 24 5000000 auto none force avgt 5 155.719 ± 3.434 ms/op
"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46, 50, 51, 55, 60, 61, 66, 68, 69, 70, 77, 88, 90, 92, 93, 94, 95, 100, 101, 102, 104, 109, 111, 113, 114, 115, 120, 121, 122, 134, 135, 136, 140, 142, 150, 155, 170, 172, 173, 174, 180, 181, 190, 199, 200, 201, 202, 203, 204)",
Benchmark (query) (rowsPerSegment) (schema) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 48 5000000 explicit none false avgt 5 858.528 ± 6.713 ms/op
SqlNestedDataBenchmark.querySql 48 5000000 explicit none force avgt 5 860.413 ± 10.305 ms/op
SqlNestedDataBenchmark.querySql 48 5000000 auto none false avgt 5 146.786 ± 5.952 ms/op
SqlNestedDataBenchmark.querySql 48 5000000 auto none force avgt 5 154.059 ± 8.155 ms/op
after:
SqlNestedDataBenchmark.querySql 48 5000000 explicit none false avgt 5 260.688 ± 2.862 ms/op
SqlNestedDataBenchmark.querySql 48 5000000 explicit none force avgt 5 256.196 ± 4.310 ms/op
SqlNestedDataBenchmark.querySql 48 5000000 auto none false avgt 5 177.440 ± 1.757 ms/op
SqlNestedDataBenchmark.querySql 48 5000000 auto none force avgt 5 147.445 ± 3.371 ms/op
"SELECT long2 FROM foo WHERE long2 IN (1, 19, 21, 23, 25, 26, 46, 50, 51, 55, 60, 61, 66, 68, 69, 70, 77, 88, 90, 92, 93, 94, 95, 100, 101, 102, 104, 109, 111, 113, 114, 115, 120, 121, 122, 134, 135, 136, 140, 142, 150, 155, 170, 172, 173, 174, 180, 181, 190, 199, 200, 201, 202, 203, 204) GROUP BY 1",
Benchmark (query) (rowsPerSegment) (schema) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 50 5000000 explicit none false avgt 5 947.775 ± 12.952 ms/op
SqlNestedDataBenchmark.querySql 50 5000000 explicit none force avgt 5 377.426 ± 10.417 ms/op
SqlNestedDataBenchmark.querySql 50 5000000 auto none false avgt 5 220.622 ± 6.089 ms/op
SqlNestedDataBenchmark.querySql 50 5000000 auto none force avgt 5 163.623 ± 1.712 ms/op
after:
SqlNestedDataBenchmark.querySql 50 5000000 explicit none false avgt 5 325.515 ± 7.563 ms/op
SqlNestedDataBenchmark.querySql 50 5000000 explicit none force avgt 5 255.146 ± 4.047 ms/op
SqlNestedDataBenchmark.querySql 50 5000000 auto none false avgt 5 216.967 ± 7.754 ms/op
SqlNestedDataBenchmark.querySql 50 5000000 auto none force avgt 5 157.717 ± 4.723 ms/op
"SELECT long2 FROM foo WHERE double3 IN (1.0, 19.0, 21.0, 23.0, 25.0, 26.0, 46.0, 50.0, 51.0, 55.0, 60.0, 61.0, 66.0, 68.0, 69.0, 70.0, 77.0, 88.0, 90.0, 92.0, 93.0, 94.0, 95.0, 100.0, 101.0, 102.0, 104.0, 109.0, 111.0, 113.0, 114.0, 115.0, 120.0, 121.0, 122.0, 134.0, 135.0, 136.0, 140.0, 142.0, 150.0, 155.0, 170.0, 172.0, 173.0, 174.0, 180.0, 181.0, 190.0, 199.0, 200.0, 201.0, 202.0, 203.0, 204.0)",
Benchmark (query) (rowsPerSegment) (schema) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 52 5000000 explicit none false avgt 5 773.810 ± 7.200 ms/op
SqlNestedDataBenchmark.querySql 52 5000000 explicit none force avgt 5 772.494 ± 3.224 ms/op
SqlNestedDataBenchmark.querySql 52 5000000 auto none false avgt 5 2.169 ± 0.218 ms/op
SqlNestedDataBenchmark.querySql 52 5000000 auto none force avgt 5 2.205 ± 0.217 ms/op
after:
SqlNestedDataBenchmark.querySql 52 5000000 explicit none false avgt 5 108.259 ± 2.217 ms/op
SqlNestedDataBenchmark.querySql 52 5000000 explicit none force avgt 5 108.179 ± 1.548 ms/op
SqlNestedDataBenchmark.querySql 52 5000000 auto none false avgt 5 2.070 ± 0.144 ms/op
SqlNestedDataBenchmark.querySql 52 5000000 auto none force avgt 5 2.019 ± 0.133 ms/op
"SELECT long2 FROM foo WHERE double3 IN (1.0, 19.0, 21.0, 23.0, 25.0, 26.0, 46.0, 50.0, 51.0, 55.0, 60.0, 61.0, 66.0, 68.0, 69.0, 70.0, 77.0, 88.0, 90.0, 92.0, 93.0, 94.0, 95.0, 100.0, 101.0, 102.0, 104.0, 109.0, 111.0, 113.0, 114.0, 115.0, 120.0, 121.0, 122.0, 134.0, 135.0, 136.0, 140.0, 142.0, 150.0, 155.0, 170.0, 172.0, 173.0, 174.0, 180.0, 181.0, 190.0, 199.0, 200.0, 201.0, 202.0, 203.0, 204.0) GROUP BY 1",
Benchmark (query) (rowsPerSegment) (schema) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlNestedDataBenchmark.querySql 54 5000000 explicit none false avgt 5 904.097 ± 9.795 ms/op
SqlNestedDataBenchmark.querySql 54 5000000 explicit none force avgt 5 242.745 ± 5.692 ms/op
SqlNestedDataBenchmark.querySql 54 5000000 auto none false avgt 5 128.160 ± 2.043 ms/op
SqlNestedDataBenchmark.querySql 54 5000000 auto none force avgt 5 129.161 ± 5.200 ms/op
after:
SqlNestedDataBenchmark.querySql 54 5000000 explicit none false avgt 5 230.324 ± 12.410 ms/op
SqlNestedDataBenchmark.querySql 54 5000000 explicit none force avgt 5 201.999 ± 4.346 ms/op
SqlNestedDataBenchmark.querySql 54 5000000 auto none false avgt 5 125.256 ± 18.262 ms/op
SqlNestedDataBenchmark.querySql 54 5000000 auto none force avgt 5 123.719 ± 8.906 ms/op
IN filters on string columns:
SELECT * FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100')
before:
Benchmark (query) (rowsPerSegment) (schema) (storageType) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 24 5000000 explicit mmap none false avgt 5 272.734 ± 18.392 ms/op
SqlBenchmark.querySql 24 5000000 explicit mmap none force avgt 5 272.081 ± 31.917 ms/op
SqlBenchmark.querySql 24 5000000 auto mmap none false avgt 5 232.986 ± 14.621 ms/op
SqlBenchmark.querySql 24 5000000 auto mmap none force avgt 5 229.473 ± 11.148 ms/op
after:
SqlBenchmark.querySql 24 5000000 explicit mmap none false avgt 5 260.132 ± 5.194 ms/op
SqlBenchmark.querySql 24 5000000 explicit mmap none force avgt 5 263.946 ± 4.753 ms/op
SqlBenchmark.querySql 24 5000000 auto mmap none false avgt 5 228.280 ± 12.907 ms/op
SqlBenchmark.querySql 24 5000000 auto mmap none force avgt 5 228.460 ± 10.062 ms/op
SELECT dimSequential, dimZipf, SUM(sumLongSequential) FROM foo WHERE dimSequential IN ('1', '2', '3', '4', '5', '10', '11', '20', '21', '23', '40', '50', '64', '70', '100') GROUP BY 1, 2
before:
Benchmark (query) (rowsPerSegment) (schema) (storageType) (stringEncoding) (vectorize) Mode Cnt Score Error Units
SqlBenchmark.querySql 26 5000000 explicit mmap none false avgt 5 26.105 ± 2.186 ms/op
SqlBenchmark.querySql 26 5000000 explicit mmap none force avgt 5 22.348 ± 1.196 ms/op
SqlBenchmark.querySql 26 5000000 auto mmap none false avgt 5 25.959 ± 0.923 ms/op
SqlBenchmark.querySql 26 5000000 auto mmap none force avgt 5 22.445 ± 1.647 ms/op
after:
SqlBenchmark.querySql 26 5000000 explicit mmap none false avgt 5 25.170 ± 0.654 ms/op
SqlBenchmark.querySql 26 5000000 explicit mmap none force avgt 5 21.514 ± 1.403 ms/op
SqlBenchmark.querySql 26 5000000 auto mmap none false avgt 5 25.258 ± 0.501 ms/op
SqlBenchmark.querySql 26 5000000 auto mmap none force avgt 5 21.564 ± 0.824 ms/op
The string results are pretty close since most of the internals are identical, though still shows a slight improvement from using the sorted list instead of sorted set.
Release note
TBD
This PR has:
- [ ] been self-reviewed.
- [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in licenses.yaml
- [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.