parquet-go icon indicating copy to clipboard operation
parquet-go copied to clipboard

Do not try to uncompress pages that are not compressed

Open papanikge opened this issue 2 months ago • 0 comments

[This is a sync of a fix already done and tested in a fork]

Relates to https://github.com/fraugster/parquet-go/issues/102

Context

Panther is running fraugster/parguet-go in production for some months now ingesting TBs of data.

Some customers reported that they got the following error:

snappy: corrupt input

After some investigation we can see that in the read function of the the V2 DataPage, the flag (already present in DataPageHeaderV2) was not checked.

More context: Parquet files - when compressed - are so in the page layer. Parquet supports compression per page, (as shown from the DataPageHeaderV2 IsCompressed field, which comes directly from the thrift definition). The library detects the compression type (called CompressionCodec) and passes that down to the newBlockReader level. However it still needs to check if that specific page is indeed compressed, and that was missing.

Checks

FWIW, I doubled check this with parquet-go/parquet-goparquet-go/parquet-go and confirmed that they don't try to decompress that.

  • [x] Unit tests added
  • [x] Full test run (and screenshot)
  • [x] Run all unit tests with the race detector on
  • [x] Run the linters locally via golangci-lint run

Ran all the tests with https://github.com/apache/parquet-testing

image

[Note: I can try adding a file into https://github.com/apache/parquet-testing before trying merging this into upstream]

Ran all unit tests with the race detector on

image

Added a unit tests ... that passes with false, but breaks if I do IsCompressed => true because the input is not snappy

papanikge avatar Apr 09 '24 16:04 papanikge