parquet-dotnet icon indicating copy to clipboard operation
parquet-dotnet copied to clipboard

DataPageHeaderV2.IsCompressed is defaulting to false on read even though the spec seems to say it should be true

Open ngbrown opened this issue 5 months ago • 1 comments

I am trying to read a snappy compressed file written by the JavaScript library hyparquet-writer and am getting an exception:

System.IO.InvalidDataException
failed to read column 'name'
   at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync[T](ParquetRowGroupReader rg, ParquetSchema schema, Assembler`1 asm, ICollection`1 result, ParquetSerializerOptions options, CancellationToken cancellationToken, Boolean resultsAlreadyAllocated) in .\parquet-dotnet\src\Parquet\Serialization\ParquetSerializer.cs:line 573
   at Parquet.Serialization.ParquetSerializer.DeserializeRowGroupAsync(ParquetReader reader, Int32 rgi, Assembler`1 asm, List`1 result, ParquetSerializerOptions options, CancellationToken cancellationToken) in .\parquet-dotnet\src\Parquet\Serialization\ParquetSerializer.cs:line 498

System.ArgumentOutOfRangeException
Specified argument was out of the range of valid values.
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Span`1 source, Span`1 data, SchemaElement tse) in .\parquet-dotnet\src\Parquet\Encodings\ParquetPlainEncoder.cs:line 1133
   at Parquet.Encodings.ParquetPlainEncoder.Decode(Array data, Int32 offset, Int32 count, SchemaElement tse, Span`1 source, Int32& elementsRead) in .\parquet-dotnet\src\Parquet\Encodings\ParquetPlainEncoder.cs:line 206

I narrowed down the cause to be a conflict between the meaning of leaving off the value of IsCompressed. In parquet-dotnet, when ph.DataPageHeaderV2.IsCompressed is null, it defaults to not attempting to decompress:

https://github.com/aloneguid/parquet-dotnet/blob/90fcfa5b5874057ec03a30bc42101ee74f5be8b7/src/Parquet/File/DataColumnReader.cs#L230

However, hyparquet-writer seems to think that when compressing, that value can be omitted when writing a compressed column, that is be null:

https://github.com/hyparam/hyparquet-writer/blob/86f4c4314e9326d7989baa6f270079d6d0ce6851/src/datapage.js#L106

Also, in parquet-dotnet, the comment for that property says null means compressed:

https://github.com/aloneguid/parquet-dotnet/blob/90fcfa5b5874057ec03a30bc42101ee74f5be8b7/src/Parquet/Meta/Parquet.cs#L1480-L1483

If missing it is considered compressed.

This matches the specification:

https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L741-L746

Here is a test file that I expect to be read:

hyparquet.snappy.parquet.zip

I generated the test file with the following Node.js code:

parquetWriteFile({
  filename: "./hyparquet.snappy.parquet",
  columnData: [
    { name: "name", data: ["Alice", "Bob", "Charlie"], type: "STRING" },
    { name: "age", data: [25, 30, 35], type: "INT32" },
  ],
});

ngbrown avatar Jun 20 '25 05:06 ngbrown

In my local clone, I added the following test to ParquetReaderOnTestFilesTest.cs along with the data file attached above in the issue:

[Fact]
public async Task HyparquetCompressed() {
    using Stream s = OpenTestFile("hyparquet.snappy.parquet");
    ParquetSerializer.UntypedResult r = await ParquetSerializer.DeserializeAsync(s);
}

And the test fails. When I change the line in DataColumnReader.cs to default to true all the tests in the solution pass (except I don't have java.exe so those integration tests fail):

if((!(ph.DataPageHeaderV2.IsCompressed ?? true)) || _thriftColumnChunk.MetaData!.Codec == CompressionCodec.UNCOMPRESSED) {

So is there a data file example that this doesn't work with somewhere?

ngbrown avatar Jun 20 '25 05:06 ngbrown