parquet-dotnet icon indicating copy to clipboard operation
parquet-dotnet copied to clipboard

Corrupt snappy compressed data when using row groups >5K rows

Open vc-74 opened this issue 5 years ago • 5 comments

Version: Parquet.Net v3.7.1

Runtime Version: .Net Framework v4.7.2

OS: Windows

Is the 5K rows per row group a recommendation or a strict rule? If I create a parquet file with 2 row groups of 10K rows each, with a string and a timestamp columns, the created file cannot be read by Impala or pyarrow, the latter reporting a "pyarrow.lib.ArrowIOError: Arrow error: IOError: Corrupt snappy compressed data." error.

Code snippet reproducing the behavior

var stringField = new DataField<string>("attribute");
var timeStampField = new DataField<DateTimeOffset>("as_of");

var schema = new Schema(stringField, timeStampField);

const int rowGroupSize = 10_000;

var stringData = new string[rowGroupSize];
var timeStampData = new DateTimeOffset[rowGroupSize];

using Stream fileStream = File.Create("c:\\temp\\test.parquet");
using var parquetWriter = new ParquetWriter(schema, fileStream);

void addRowGroup()
{
    for (int i = 0; i < rowGroupSize; i++)
    {
        stringData[i] = "x";
        timeStampData[i] = DateTimeOffset.UtcNow;
    }

    var stringColumn = new DataColumn(stringField, stringData);
    var timeStampColumn = new DataColumn(timeStampField, timeStampData);

    using ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup();
    groupWriter.WriteColumn(stringColumn);
    groupWriter.WriteColumn(timeStampColumn);
}

addRowGroup();
addRowGroup();

Read from python:

import pyarrow.parquet as pq
table = pq.read_table('c:\\temp\\test.parquet')

Created Parquet file attached test.parquet.zip

vc-74 avatar May 14 '20 14:05 vc-74

Im having the exact same issue on Linux when row count>5000

Version: Parquet.Net v3.7.1 Runtime Version: .NET Core SDK (3.1.201) OS: Linux/LUbuntu

bktan81 avatar May 16 '20 13:05 bktan81

I restored to the previous version 3.7.0 and it works fine

bktan81 avatar May 16 '20 14:05 bktan81

I still don't known which project is right, but i think you shoud put the issue on https://github.com/aloneguid/parquet-dotnet.

and the problem have been solved. https://github.com/aloneguid/IronSnappy/releases v1.2.2

but no new version released. https://github.com/aloneguid/parquet-dotnet

skyyearxp avatar May 18 '20 01:05 skyyearxp

i think the main project is maintained on alonguid/parquet-dotnet now. you can search nuget, find Parquet.Net and look at the version history. before v3.4.0 the projet url is https://github.com/elastacloud/parquet-dotnet after 3.4.1 the project url is https://github.com/aloneguid/parquet-dotnet so be it. :)

skyyearxp avatar May 18 '20 01:05 skyyearxp

I restored to the previous version 3.7.0 and it works fine

Same here, I'll post the issue to the other repo as suggested by @skyyearxp and wait to upgrade.

vc-74 avatar May 19 '20 07:05 vc-74