parquet-dotnet icon indicating copy to clipboard operation
parquet-dotnet copied to clipboard

Feature Request: Consider use of IReadOnlyList<object> or ReadOnlySpan<object> instead System.Array

Open aersam opened this issue 3 years ago • 7 comments

Version: Parquet.Net v.3.8

Runtime Version: All

OS: All

Expected behavior

In order to create a Data Column object I am forced to use an Array which often leads to an unnecessary copy of data. Using an interface such as IReadOnlyList would be a lot nicer from a user perspective. If this is not possible, consider using ReadOnlySpan

I am aware that this is breaking, but is should be fairly simple to change the calling code.

Actual behavior

It's always using array which is not as flexible as an interface

aersam avatar Apr 21 '21 09:04 aersam

Yes! This could significantly improve the allocation profile for our usages.

godefroi avatar Jul 13 '21 15:07 godefroi

Hi, @aloneguid I would at least opt for extending the interface with something like DataColumn(Array buffer, int offset, int count, int[] repetitionLevels = null); (But Span/Memory would be also nice.) Or am I using it wrong? How else am I supposed to write columns in a row group whose size I may not know in advance without copying the whole array for each column? Thanks P.S.: I am happy to implement that and make a PR if there's an agreement about it.

EDIT: I was digging a bit more and it actually seems to me it may not be worth the hassle, at least for us...

hrabeale avatar Nov 22 '21 09:11 hrabeale

@hrabeale I've made a PR adding offset and count parameters and I'm using it locally with passing in arrays from ArrayPool<T>. It seems to work. :)

aclemmensen avatar Nov 23 '21 08:11 aclemmensen

Sorry if I'm abusing this thread. I still cannot get my head around the usage. Even if they adopt this PR, I still have to pre-allocate one huge array. Wouldn't it be possible to write to column in batches? We used wrapped parquet-cpp in a manner similar to this

ParquetRowGroupWriter writer;
//...
 for (var i = 0; i < _buffers.Count; ++i)
{
    if (i == _buffers.Count - 1)
    {
        // or use the offset and length instead of copy, if available
        var data = new T[_nextBufferIndex];
        Array.Copy(_buffers[i], data, _nextBufferIndex);

        writer.WriteColumn(new DataColumn(_field, data), closeColumn: true);
    }
    else
        writer.WriteColumn(new DataColumn(_field, _buffers[i]), closeColumn: false);
}

So if I'm reading the code correctly, parquet-dotnet only supports writing a column as a single data page per row group?

hrabeale avatar Nov 24 '21 11:11 hrabeale

I think you can do that theoretically, but writing in chunks will effectively discard any logical compression. You could use smaller row group size and array pooling?

aloneguid avatar Nov 29 '21 09:11 aloneguid

I think you can do that theoretically, but writing in chunks will effectively discard any logical compression. You could use smaller row group size and array pooling?

I am still trying to get a better grasp of the performance implications. I reckon with a reasonable data page size one may still get a good benefit from compression while maintaining fair access speed, hence the recommendation? (8 KB data page)

Until now we have been using another proprietary column based format where we had batches around 4096 items (16 KB for int32) and the overall compression ratio was pretty much the same as if we converted the data to parquet with some other tool. In some specific cases the compression may even improve as one can choose different compression/encoding strategies per data page.

From what I have seen so far, the common practice with parquet is to have one row group per file. Then tools like spark are optimized to scale horizontally while sending these row groups to allocated nodes for separate processing. Or am I getting it wrong?

hrabeale avatar Nov 29 '21 12:11 hrabeale

Actually as I wrote in the https://github.com/aloneguid/parquet-dotnet/issues/137 I already tested it (writing smaller data pages with parquet-dotnet) and it had close to none impacts on compression.

hrabeale avatar Nov 29 '21 14:11 hrabeale