datafusion
datafusion copied to clipboard
Cannot GROUP BY Binary
Describe the bug A clear and concise description of what the bug is.
This is part of #3048
I was doing the benchmark for clickbench. One of it's column is binary, and the test query set contains group by
that binary column. I got this error:
Internal error: Unsupported data type in hasher: Binary. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker"
To Reproduce Steps to reproduce the behavior:
# Download data
wget https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_0.parquet
# Use Datafusion-CLI
➜ datafusion git:(datafusion) ✗ datafusion-cli
DataFusion CLI v10.0.0
# Create External Table
❯ CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits_0.parquet';
0 rows in set. Query took 0.002 seconds.
# This query work
❯ SELECT "URL" FROM hits LIMIT 10;
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| URL |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| |
| |
| |
| |
| 687474703a2f2f686f6c6f64696c6e696b2e72752f7275737369612f30356a756c32303133266d6f64656c3d30 |
| 687474703a2f2f6166697368612e6d61696c2e72752f636174616c6f672f3331342f776f6d656e2e72752f656e63793d312670616765332f3f6572726f7661742d70696e6e696b69 |
| 687474703a2f2f626f6e707269782e72752f696e6465782e72752f63696e656d612f6172742f3020393836203432342032333320d181d0b5d0b7d0bed0bd |
| 687474703a2f2f626f6e707269782e72752f696e6465782e72752f63696e656d612f6172742f4130303338372c33373937293b2072752926624c |
| 687474703a2f2f746f7572732f456b617465676f726979612532462673723d687474703a2f2f736c6f766172656e697965 |
| |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set. Query took 0.006 seconds.
# This one doesn't work
❯ SELECT "URL" FROM hits GROUP BY "URL" LIMIT 10;
ArrowError(ExternalError(Execution("Internal error: Unsupported data type in hasher: Binary. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker")))
10 rows in set. Query took 0.006 seconds.
Expected behavior A clear and concise description of what you expected to happen.
Additional context Add any other context about the problem here.
That would be weird.
I think the error is expected, there is a missing match for the Binary
datatype here. I expect adding the implementation here will solve that:
https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/hash_utils.rs#L606
@Dandandan you're right, i accident added the cast (`"URL":TEXT) and thought that it worked
I am currently working on this.