Corrupt snappy compressed data when using row groups >5K rows
Version: Parquet.Net v3.7.1
Runtime Version: .Net Framework v4.7.2
OS: Windows
Is the 5K rows per row group a recommendation or a strict rule? If I create a parquet file with 2 row groups of 10K rows each, with a string and a timestamp columns, the created file cannot be read by Impala or pyarrow, the latter reporting a "pyarrow.lib.ArrowIOError: Arrow error: IOError: Corrupt snappy compressed data." error.
Code snippet reproducing the behavior
var stringField = new DataField<string>("attribute");
var timeStampField = new DataField<DateTimeOffset>("as_of");
var schema = new Schema(stringField, timeStampField);
const int rowGroupSize = 10_000;
var stringData = new string[rowGroupSize];
var timeStampData = new DateTimeOffset[rowGroupSize];
using Stream fileStream = File.Create("c:\\temp\\test.parquet");
using var parquetWriter = new ParquetWriter(schema, fileStream);
void addRowGroup()
{
for (int i = 0; i < rowGroupSize; i++)
{
stringData[i] = "x";
timeStampData[i] = DateTimeOffset.UtcNow;
}
var stringColumn = new DataColumn(stringField, stringData);
var timeStampColumn = new DataColumn(timeStampField, timeStampData);
using ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup();
groupWriter.WriteColumn(stringColumn);
groupWriter.WriteColumn(timeStampColumn);
}
addRowGroup();
addRowGroup();
Read from python:
import pyarrow.parquet as pq
table = pq.read_table('c:\\temp\\test.parquet')
Created Parquet file attached test.parquet.zip
Im having the exact same issue on Linux when row count>5000
Version: Parquet.Net v3.7.1 Runtime Version: .NET Core SDK (3.1.201) OS: Linux/LUbuntu
I restored to the previous version 3.7.0 and it works fine
I still don't known which project is right, but i think you shoud put the issue on https://github.com/aloneguid/parquet-dotnet.
and the problem have been solved. https://github.com/aloneguid/IronSnappy/releases v1.2.2
but no new version released. https://github.com/aloneguid/parquet-dotnet
i think the main project is maintained on alonguid/parquet-dotnet now. you can search nuget, find Parquet.Net and look at the version history. before v3.4.0 the projet url is https://github.com/elastacloud/parquet-dotnet after 3.4.1 the project url is https://github.com/aloneguid/parquet-dotnet so be it. :)
I restored to the previous version 3.7.0 and it works fine
Same here, I'll post the issue to the other repo as suggested by @skyyearxp and wait to upgrade.