[C++][Parquet] Detect parquet-mr style dictionary_page
parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)
So whenever parquet-cpp (pyarrow) reads the file, it sets has_dictionary_page: False and dictionary_page_offset: None
row group 0
--------------------------------------------------------------------------------
x: DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y: BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
file_offset: 4
file_path:
physical_type: DOUBLE
num_values: 70000
path_in_schema: x
is_stats_set: True
statistics:
<pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
has_min_max: True
min: 1.0
max: 5.0
null_count: 10000
distinct_count: 0
num_values: 60000
physical_type: DOUBLE
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
has_dictionary_page: False
dictionary_page_offset: None
data_page_offset: 4
total_compressed_size: 1632
total_uncompressed_size: 31635
Is parquet-cpp still able to use the dictionary in this case?
It would be nice if parquet-cpp can recognize the parquet-mr issue and set has_dictionary_page to True.
https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/
Reporter: colin fang
Note: This issue was originally created as PARQUET-1547. Please see the migration documentation for further details.
This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.
Labelled Status: Stale-Warning for tracking.