parquet-dotnet Feature Request: Consider use of IReadOnlyList<object> or ReadOnlySpan<object> instead System.Array

Version: Parquet.Net v.3.8

Runtime Version: All

OS: All

Expected behavior

In order to create a Data Column object I am forced to use an Array which often leads to an unnecessary copy of data. Using an interface such as IReadOnlyList

I am aware that this is breaking, but is should be fairly simple to change the calling code.

Actual behavior

It's always using array which is not as flexible as an interface

Apr 21 '21 09:04 aersam

Yes! This could significantly improve the allocation profile for our usages.

Jul 13 '21 15:07 godefroi

Hi, @aloneguid I would at least opt for extending the interface with something like DataColumn(Array buffer, int offset, int count, int[] repetitionLevels = null); (But Span/Memory would be also nice.) Or am I using it wrong? How else am I supposed to write columns in a row group whose size I may not know in advance without copying the whole array for each column? Thanks P.S.: I am happy to implement that and make a PR if there's an agreement about it.

EDIT: I was digging a bit more and it actually seems to me it may not be worth the hassle, at least for us...

Nov 22 '21 09:11 hrabeale

@hrabeale I've made a PR adding offset and count parameters and I'm using it locally with passing in arrays from ArrayPool<T>. It seems to work. :)

Nov 23 '21 08:11 aclemmensen

Sorry if I'm abusing this thread. I still cannot get my head around the usage. Even if they adopt this PR, I still have to pre-allocate one huge array. Wouldn't it be possible to write to column in batches? We used wrapped parquet-cpp in a manner similar to this

ParquetRowGroupWriter writer;
//...
 for (var i = 0; i < _buffers.Count; ++i)
{
    if (i == _buffers.Count - 1)
    {
        // or use the offset and length instead of copy, if available
        var data = new T[_nextBufferIndex];
        Array.Copy(_buffers[i], data, _nextBufferIndex);

        writer.WriteColumn(new DataColumn(_field, data), closeColumn: true);
    }
    else
        writer.WriteColumn(new DataColumn(_field, _buffers[i]), closeColumn: false);
}

So if I'm reading the code correctly, parquet-dotnet only supports writing a column as a single data page per row group?

Nov 24 '21 11:11 hrabeale

I think you can do that theoretically, but writing in chunks will effectively discard any logical compression. You could use smaller row group size and array pooling?

Nov 29 '21 09:11 aloneguid

I think you can do that theoretically, but writing in chunks will effectively discard any logical compression. You could use smaller row group size and array pooling?

I am still trying to get a better grasp of the performance implications. I reckon with a reasonable data page size one may still get a good benefit from compression while maintaining fair access speed, hence the recommendation? (8 KB data page)

Until now we have been using another proprietary column based format where we had batches around 4096 items (16 KB for int32) and the overall compression ratio was pretty much the same as if we converted the data to parquet with some other tool. In some specific cases the compression may even improve as one can choose different compression/encoding strategies per data page.

From what I have seen so far, the common practice with parquet is to have one row group per file. Then tools like spark are optimized to scale horizontally while sending these row groups to allocated nodes for separate processing. Or am I getting it wrong?

Nov 29 '21 12:11 hrabeale

Actually as I wrote in the https://github.com/aloneguid/parquet-dotnet/issues/137 I already tested it (writing smaller data pages with parquet-dotnet) and it had close to none impacts on compression.

Nov 29 '21 14:11 hrabeale

parquet-dotnet parquet-dotnet copied to clipboard

Feature Request: Consider use of IReadOnlyList<object> or ReadOnlySpan<object> instead System.Array

Expected behavior

Actual behavior

parquet-dotnet
parquet-dotnet copied to clipboard