datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Cannot GROUP BY Binary

Open waitingkuo opened this issue 2 years ago • 4 comments

Describe the bug A clear and concise description of what the bug is.

This is part of #3048

I was doing the benchmark for clickbench. One of it's column is binary, and the test query set contains group by that binary column. I got this error:

Internal error: Unsupported data type in hasher: Binary. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker"

To Reproduce Steps to reproduce the behavior:


# Download data
wget https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_0.parquet

# Use Datafusion-CLI
➜  datafusion git:(datafusion) ✗ datafusion-cli
DataFusion CLI v10.0.0

# Create External Table 
❯ CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits_0.parquet';
0 rows in set. Query took 0.002 seconds.

# This query work
❯ SELECT "URL" FROM hits LIMIT 10;
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| URL                                                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                  |
|                                                                                                                                                  |
|                                                                                                                                                  |
|                                                                                                                                                  |
| 687474703a2f2f686f6c6f64696c6e696b2e72752f7275737369612f30356a756c32303133266d6f64656c3d30                                                       |
| 687474703a2f2f6166697368612e6d61696c2e72752f636174616c6f672f3331342f776f6d656e2e72752f656e63793d312670616765332f3f6572726f7661742d70696e6e696b69 |
| 687474703a2f2f626f6e707269782e72752f696e6465782e72752f63696e656d612f6172742f3020393836203432342032333320d181d0b5d0b7d0bed0bd                     |
| 687474703a2f2f626f6e707269782e72752f696e6465782e72752f63696e656d612f6172742f4130303338372c33373937293b2072752926624c                             |
| 687474703a2f2f746f7572732f456b617465676f726979612532462673723d687474703a2f2f736c6f766172656e697965                                               |
|                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set. Query took 0.006 seconds.

# This one doesn't work
❯ SELECT "URL" FROM hits GROUP BY "URL" LIMIT 10;
ArrowError(ExternalError(Execution("Internal error: Unsupported data type in hasher: Binary. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker")))

10 rows in set. Query took 0.006 seconds.

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

waitingkuo avatar Aug 05 '22 22:08 waitingkuo

looks like this only happens in mac, it works in linux environment

waitingkuo avatar Aug 10 '22 14:08 waitingkuo

That would be weird.

I think the error is expected, there is a missing match for the Binary datatype here. I expect adding the implementation here will solve that: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/hash_utils.rs#L606

Dandandan avatar Aug 10 '22 16:08 Dandandan

@Dandandan you're right, i accident added the cast (`"URL":TEXT) and thought that it worked

waitingkuo avatar Aug 10 '22 16:08 waitingkuo

I am currently working on this.

Dandandan avatar Aug 10 '22 17:08 Dandandan