parquetjs
parquetjs copied to clipboard
ENUM types unsupported
I'm trying to read a file which contains ENUM types, however (e.g. parquet-tools schema shows something like required binary entryMethod (ENUM);) but parquetjs.ParquetReader.openFile() just throws an error:
Invalid ENUM value
... which means I can't use this package to read my files.
What are my options ? Are there any plans to support this type ?
Thanks,
FYI I've posted a very similar problem to... https://github.com/kbajalc/parquets/issues/3 because both projects seem to suffer the same problem.
Do you have a sample parquet file that fails?
@ZJONSSON, nothing that I can share right now ... Sensitive data and all that.
For now I can share more output from parquet-tools - if that might help.
creator: parquet-mr version 1.9.0 (build 38262e2c80015d0935dad20f8e18f2d6f9fbd03c)
<snip>
extra: writer.model.name = avro
<snip>
file schema: <redacted>
--------------------------------------------------------------------------------
<redacted>
..entryMethod: REQUIRED BINARY O:ENUM R:1 D:1
<snip>
row group 1: RC:100000 TS:74883019 OFFSET:4
--------------------------------------------------------------------------------
<snip>
..entryMethod: BINARY SNAPPY DO:0 FPO:39740111 SZ:227406/230832/1.02 VC:747359 ENC:RLE,PLAIN_DICTIONARY ST:[no stats for this column]
<snip>
I can try to find out how the file was created. Maybe see how to rustle one up.
Also FYI, when I look at the values for the ENUM columns, I can see that they are base 64 encoded strings.
Quote from https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#enum
ENUM ENUM annotates the binary primitive type and indicates that the value was converted from an enumerated type in another data model (e.g. Thrift, Avro, Protobuf). Applications using a data model lacking a native enum type should interpret ENUM annotated field as a UTF-8 encoded string.
The sort order used for ENUM values is unsigned byte-wise comparison.
Thank you - please see if you can create a simple example that you can share? That way I can take a look and see if there is an easy fix!
Any updates?
Hi, I have to be honest and say that I don't think I'm ever going to be spend the time needed to reproduce this.
I've decided to AWS S3 Select to extract the data I need from my parquet files.
Thanks, and sorry if this has wasted anyone's time.
@lqueryvg All good, I'm having the same issue right now and was wondering if @ZJONSSON or others had a chance to work with ENUMS?
could anyone please let me know how to generate logical type "DECIMAL" ?