parquet-go icon indicating copy to clipboard operation
parquet-go copied to clipboard

Library tries to uncompress pages even if declared uncompressed

Open papanikge opened this issue 2 months ago • 0 comments

Describe the bug

Parquet files - when compressed - are so in the page layer. Parquet supports compression per page, (as shown from the DataPageHeaderV2 IsCompressed field, which comes directly from the thrift definition). The library detects the compression type (called CompressionCodec) and passes that down to the newBlockReader level. However it still needs to check if that specific page is indeed compressed, and that was missing.

Unit test to reproduce

I have a slim and simple unit test here, but I could write a full-fledged one with a test file if required.

parquet-go specific details

  • v0.12.0

Misc Details

  • I have already patched this in a fork and we're using it in Panther's production for the last 2 weeks. It seems it's working.
  • I have tested it with a test file too (not sure where to upload it if you guys want it)

papanikge avatar Apr 09 '24 16:04 papanikge