datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Comet cannot read decimals with physical type BINARY

Open comphead opened this issue 1 year ago • 6 comments

Describe the bug

The user raised the issue when Comet crashes on

Column: [price], Expected: decimal(15,2), Found: BINARY.

when reading the parquet file

The parquet file meta is

  optional binary price (DECIMAL(15,2));

Spark without Comet reads the data with no issues

Steps to reproduce

No response

Expected behavior

Should read the value

Additional context

No response

comphead avatar Jun 13 '24 21:06 comphead

I'll look into this @comphead.

parthchandra avatar Jun 13 '24 22:06 parthchandra

Thanks @parthchandra the issue is likely in org.apache.comet.parquet.TypeUtil.checkParquetType when deriving the decimal type

comphead avatar Jun 14 '24 00:06 comphead

Update on this - Spark vectorized reader also throws the same error. Users have to turn off vectorized reading to read such files. It is also pretty near impossible to write a binary decimal field (as opposed to a fixed length byte-array field) using Spark. One has to use the Parquet writer or some other project (avro for example) to write such fields. In Comet there is in fact no implementation to decode a binary decimal field just like there is none in the Spark vectorized reader. It should be possible to implement, but I'm wondering if this is a niche case. @comphead

parthchandra avatar Jun 28 '24 22:06 parthchandra

@comphead @parthchandra can we close this issue?

andygrove avatar Sep 19 '24 16:09 andygrove

Well, the issue still exists, however its related to deprecated Parquet formats where Decimal is represented as BINARY. We probably should mention this in doc that such kind of conversions are not supported

comphead avatar Sep 19 '24 17:09 comphead

Yes, let's close this. We can revisit this if more people report it.

parthchandra avatar Sep 19 '24 21:09 parthchandra