druid modify equality and typed in filter behavior for numeric match values on string columns

Description

This PR changes EqualityFilter and TypedInfilter numeric match values against string columns to now use StringComparators.NUMERIC instead of converting the numeric values directly to string for pure string equality. This makes using these filters (the default in SQL compatible mode) behave consistently with 'default' value mode which uses the BoundFilter for numeric comparison of string values.

This effectively is an implicit cast of the STRING values to the numeric match value type, which is consistent with the casts which are eaten in the SQL layer, as well as "classic" druid behavior where we do our best with what we are given.

The added tests to cover numeric equality matching. Double match values in particular would fail to match the string values since 1.0 would become '1.0' which does not match '1'.

I investigated an alternative of just having the SQL planner not eat CAST operators like it currently does, since a query like ... WHERE stringColumn = 1.0 ends up as ... WHERE CAST(stringColumn as DOUBLE) = 1.0 before we eat the cast, which makes the native layer a lot more explicit, however this change basically does the same thing and is a lot less disruptive.

It may still be worth investigating not eating casts during SQL planning, but I'll save that for follow-up work.

Release note

Modified behavior of using EqualityFilter and TypedInFilter to match numeric typed values (particularly DOUBLE) against string columns to effectively cast the strings to use numerical comparison, for more consistent Druid behavior between sqlUseBoundAndSelectors context flag.

This PR has:

[x] been self-reviewed.
[x] a release note entry in the PR description.
[x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
[x] been tested in a test Druid cluster.

Jun 13 '24 04:06 clintropolis

It may still be worth investigating not eating casts during SQL planning, but I'll save that for follow-up work.

I think the functions in Calcite which could make it easy to discard casts (like RexUtil#isLiteral) ; could have made it too easy to discard such things :)

Jun 19 '24 12:06 kgyrtkirk

I think the functions in Calcite which could make it easy to discard casts (like RexUtil#isLiteral) ; could have made it too easy to discard such things :)

It is like basically a 1 line change to not discard the casts in SQL layer, https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/expression/builtin/CastOperatorConversion.java#L110.

The additional work is all at the native layer to make it so that there is no performance penalty to having the casts be explicit (lots of things implicitly perform the cast at the native layer is why we eat them in the first place), and UNNEST also needs some modification to allow additional virtual columns be defined and continue to allow filter pushdown of casts on the unnest column.

Jun 19 '24 20:06 clintropolis