parquet-go
parquet-go copied to clipboard
Do not try to uncompress pages that are not compressed
[This is a sync of a fix already done and tested in a fork]
Relates to https://github.com/fraugster/parquet-go/issues/102
Context
Panther is running fraugster/parguet-go in production for some months now ingesting TBs of data.
Some customers reported that they got the following error:
snappy: corrupt input
After some investigation we can see that in the read
function of the the V2 DataPage, the flag (already present in DataPageHeaderV2
) was not checked.
More context: Parquet files - when compressed - are so in the page layer. Parquet supports compression per page, (as shown from the DataPageHeaderV2
IsCompressed
field, which comes directly from the thrift definition). The library detects the compression type (called CompressionCodec
) and passes that down to the newBlockReader
level. However it still needs to check if that specific page is indeed compressed, and that was missing.
Checks
FWIW, I doubled check this with parquet-go/parquet-goparquet-go/parquet-go and confirmed that they don't try to decompress that.
- [x] Unit tests added
- [x] Full test run (and screenshot)
- [x] Run all unit tests with the race detector on
- [x] Run the linters locally via golangci-lint run
Ran all the tests with https://github.com/apache/parquet-testing
[Note: I can try adding a file into https://github.com/apache/parquet-testing before trying merging this into upstream]
Ran all unit tests with the race detector on
Added a unit tests
... that passes with false, but breaks if I do IsCompressed
=> true
because the input is not snappy