DataPageHeaderV2.IsCompressed is defaulting to false on read even though the spec seems to say it should be true
I am trying to read a snappy compressed file written by the JavaScript library hyparquet-writer and am getting an exception:
System.IO.InvalidDataException
failed to read column 'name'
at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetRowGroupReader rg, ParquetSchema schema, Assembler`1 asm, ICollection`1 result, ParquetSerializerOptions options, CancellationToken cancellationToken, Boolean resultsAlreadyAllocated) in .\parquet-dotnet\src\Parquet\Serialization\ParquetSerializer.cs:line 573
at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync(ParquetReader reader, Int32 rgi, Assembler`1 asm, List`1 result, ParquetSerializerOptions options, CancellationToken cancellationToken) in .\parquet-dotnet\src\Parquet\Serialization\ParquetSerializer.cs:line 498
System.ArgumentOutOfRangeException
Specified argument was out of the range of valid values.
at Parquet.Encodings.ParquetPlainEncoder.Decode(Span`1 source, Span`1 data, SchemaElement tse) in .\parquet-dotnet\src\Parquet\Encodings\ParquetPlainEncoder.cs:line 1133
at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead) in .\parquet-dotnet\src\Parquet\Encodings\ParquetPlainEncoder.cs:line 206
I narrowed down the cause to be a conflict between the meaning of leaving off the value of IsCompressed. In parquet-dotnet, when ph.DataPageHeaderV2.IsCompressed is null, it defaults to not attempting to decompress:
https://github.com/aloneguid/parquet-dotnet/blob/90fcfa5b5874057ec03a30bc42101ee74f5be8b7/src/Parquet/File/DataColumnReader.cs#L230
However, hyparquet-writer seems to think that when compressing, that value can be omitted when writing a compressed column, that is be null:
https://github.com/hyparam/hyparquet-writer/blob/86f4c4314e9326d7989baa6f270079d6d0ce6851/src/datapage.js#L106
Also, in parquet-dotnet, the comment for that property says null means compressed:
https://github.com/aloneguid/parquet-dotnet/blob/90fcfa5b5874057ec03a30bc42101ee74f5be8b7/src/Parquet/Meta/Parquet.cs#L1480-L1483
If missing it is considered compressed.
This matches the specification:
https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L741-L746
Here is a test file that I expect to be read:
I generated the test file with the following Node.js code:
parquetWriteFile({
filename: "./hyparquet.snappy.parquet",
columnData: [
{ name: "name", data: ["Alice", "Bob", "Charlie"], type: "STRING" },
{ name: "age", data: [25, 30, 35], type: "INT32" },
],
});
In my local clone, I added the following test to ParquetReaderOnTestFilesTest.cs along with the data file attached above in the issue:
[Fact]
public async Task HyparquetCompressed() {
using Stream s = OpenTestFile("hyparquet.snappy.parquet");
ParquetSerializer.UntypedResult r = await ParquetSerializer.DeserializeAsync(s);
}
And the test fails. When I change the line in DataColumnReader.cs to default to true all the tests in the solution pass (except I don't have java.exe so those integration tests fail):
if((!(ph.DataPageHeaderV2.IsCompressed ?? true)) || _thriftColumnChunk.MetaData!.Codec == CompressionCodec.UNCOMPRESSED) {
So is there a data file example that this doesn't work with somewhere?