[C++][Parquet] Detect parquet-mr style dictionary_page

Open asfimport opened this issue 6 years ago • 1 comments

parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)

So whenever parquet-cpp (pyarrow) reads the file, it sets has_dictionary_page: False and dictionary_page_offset: None


row group 0 
--------------------------------------------------------------------------------
x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y:  BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]

    x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000


<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
  file_offset: 4
  file_path: 
  physical_type: DOUBLE
  num_values: 70000
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
      has_min_max: True
      min: 1.0
      max: 5.0
      null_count: 10000
      distinct_count: 0
      num_values: 60000
      physical_type: DOUBLE
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 1632
  total_uncompressed_size: 31635

Is parquet-cpp still able to use the dictionary in this case? It would be nice if parquet-cpp can recognize the parquet-mr issue and set has_dictionary_page to True.

https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/

Reporter: colin fang

_{Note: This issue was originally created as PARQUET-1547. Please see the migration documentation for further details.}

Mar 18 '19 18:03 asfimport

This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.

Labelled Status: Stale-Warning for tracking.

Jun 21 '25 08:06 thisisnic