datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Slow comparisions to dictionary columns with type coercion

Open alamb opened this issue 1 year ago • 1 comments

Is your feature request related to a problem or challenge?

In InfluxDB we use Dictionary(Int32, Utf8) columns a lot.

Queries like this (with string constants) work great and are very fast

SELECT ... WHERE column = '1'

Queries like this (note 1 is an integer, not a '1') the query goes very slow

SELECT ... WHERE column = 1

@erratic-pattern and I tracked this down to an issue/ limitation in type coercion:

Reproducer

DataFusion CLI v37.1.0
> create table test as values (arrow_cast('1', 'Dictionary(Int32, Utf8)'));
0 row(s) fetched.
Elapsed 0.010 seconds.

> select arrow_typeof(column1) from test;
+----------------------------+
| arrow_typeof(test.column1) |
+----------------------------+
| Dictionary(Int32, Utf8)    |
+----------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> explain SELECT * from test where column1 = 1;
+---------------+---------------------------------------------------+
| plan_type     | plan                                              |
+---------------+---------------------------------------------------+
| logical_plan  | Filter: CAST(test.column1 AS Utf8) = Utf8("1")    |
|               |   TableScan: test projection=[column1]            |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192       |
|               |   FilterExec: CAST(column1@0 AS Utf8) = 1         |
|               |     MemoryExec: partitions=1, partition_sizes=[1] |
|               |                                                   |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.003 seconds.

I think this shows the core problem:

| logical_plan  | Filter: CAST(test.column1 AS Utf8) = Utf8("1")    |

It basically shows the column is being converted to a string, rather than the constant being converted to th ecorrect type.

Not only does this mean the column is being un-encoded for the comparsion, it also means that PruningPredicate doesn't work either

Describe the solution you'd like

I would like the query to go fast lol

Specifically, I think the filter should look like this (no cast on the column, and instead the constant type matches)

| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |

Note this is what happens if you compare the dictionary column to a string literal:


> explain SELECT * from test where column1 = '1';
+---------------+-----------------------------------------------------+
| plan_type     | plan                                                |
+---------------+-----------------------------------------------------+
| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
|               |   TableScan: test projection=[column1]              |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192         |
|               |   FilterExec: column1@0 = 1                         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]   |
|               |                                                     |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

>

Describe alternatives you've considered

We could potentially update the coercion logic to coerce 1 to Dictionary(.. "1") or maybe update the unwrap_comparsion logic

Additional context

No response

alamb avatar Apr 24 '24 20:04 alamb

I have a PR that fixes this. https://github.com/apache/datafusion/pull/10221 Here is the explain after making the change:

> explain SELECT * from test where column1 = 1;
+---------------+-----------------------------------------------------+
| plan_type     | plan                                                |
+---------------+-----------------------------------------------------+
| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
|               |   TableScan: test projection=[column1]              |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192         |
|               |   FilterExec: column1@0 = 1                         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]   |
|               |                                                     |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.

However it looks like some tests are failing so I am still looking into it.

erratic-pattern avatar Apr 24 '24 20:04 erratic-pattern

https://github.com/apache/datafusion/pull/10323 is ready for review and avoids the previously discussed issues with https://github.com/apache/datafusion/pull/10221

erratic-pattern avatar Apr 30 '24 22:04 erratic-pattern

Thanks @erratic-pattern -- I hope to look at this tomorrow morning

alamb avatar Apr 30 '24 22:04 alamb