Slow comparisions to dictionary columns with type coercion
Is your feature request related to a problem or challenge?
In InfluxDB we use Dictionary(Int32, Utf8) columns a lot.
Queries like this (with string constants) work great and are very fast
SELECT ... WHERE column = '1'
Queries like this (note 1 is an integer, not a '1') the query goes very slow
SELECT ... WHERE column = 1
@erratic-pattern and I tracked this down to an issue/ limitation in type coercion:
Reproducer
DataFusion CLI v37.1.0
> create table test as values (arrow_cast('1', 'Dictionary(Int32, Utf8)'));
0 row(s) fetched.
Elapsed 0.010 seconds.
> select arrow_typeof(column1) from test;
+----------------------------+
| arrow_typeof(test.column1) |
+----------------------------+
| Dictionary(Int32, Utf8) |
+----------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.
> explain SELECT * from test where column1 = 1;
+---------------+---------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------+
| logical_plan | Filter: CAST(test.column1 AS Utf8) = Utf8("1") |
| | TableScan: test projection=[column1] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: CAST(column1@0 AS Utf8) = 1 |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.003 seconds.
I think this shows the core problem:
| logical_plan | Filter: CAST(test.column1 AS Utf8) = Utf8("1") |
It basically shows the column is being converted to a string, rather than the constant being converted to th ecorrect type.
Not only does this mean the column is being un-encoded for the comparsion, it also means that PruningPredicate doesn't work either
Describe the solution you'd like
I would like the query to go fast lol
Specifically, I think the filter should look like this (no cast on the column, and instead the constant type matches)
| logical_plan | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
Note this is what happens if you compare the dictionary column to a string literal:
> explain SELECT * from test where column1 = '1';
+---------------+-----------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------+
| logical_plan | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
| | TableScan: test projection=[column1] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: column1@0 = 1 |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.
>
Describe alternatives you've considered
We could potentially update the coercion logic to coerce 1 to Dictionary(.. "1") or maybe update the unwrap_comparsion logic
Additional context
No response
I have a PR that fixes this. https://github.com/apache/datafusion/pull/10221 Here is the explain after making the change:
> explain SELECT * from test where column1 = 1;
+---------------+-----------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------+
| logical_plan | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
| | TableScan: test projection=[column1] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: column1@0 = 1 |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.
However it looks like some tests are failing so I am still looking into it.
https://github.com/apache/datafusion/pull/10323 is ready for review and avoids the previously discussed issues with https://github.com/apache/datafusion/pull/10221
Thanks @erratic-pattern -- I hope to look at this tomorrow morning