parquet-dotnet
parquet-dotnet copied to clipboard
Feature Request: Consider use of IReadOnlyList<object> or ReadOnlySpan<object> instead System.Array
Version: Parquet.Net v.3.8
Runtime Version: All
OS: All
Expected behavior
In order to create a Data Column object I am forced to use an Array which often leads to an unnecessary copy of data. Using an interface such as IReadOnlyList
I am aware that this is breaking, but is should be fairly simple to change the calling code.
Actual behavior
It's always using array which is not as flexible as an interface
Yes! This could significantly improve the allocation profile for our usages.
Hi,
@aloneguid
I would at least opt for extending the interface with something like
DataColumn(Array buffer, int offset, int count, int[] repetitionLevels = null);
(But Span/Memory would be also nice.)
Or am I using it wrong? How else am I supposed to write columns in a row group whose size I may not know in advance without copying the whole array for each column?
Thanks
P.S.: I am happy to implement that and make a PR if there's an agreement about it.
EDIT: I was digging a bit more and it actually seems to me it may not be worth the hassle, at least for us...
@hrabeale I've made a PR adding offset
and count
parameters and I'm using it locally with passing in arrays from ArrayPool<T>
. It seems to work. :)
Sorry if I'm abusing this thread. I still cannot get my head around the usage. Even if they adopt this PR, I still have to pre-allocate one huge array. Wouldn't it be possible to write to column in batches? We used wrapped parquet-cpp in a manner similar to this
ParquetRowGroupWriter writer;
//...
for (var i = 0; i < _buffers.Count; ++i)
{
if (i == _buffers.Count - 1)
{
// or use the offset and length instead of copy, if available
var data = new T[_nextBufferIndex];
Array.Copy(_buffers[i], data, _nextBufferIndex);
writer.WriteColumn(new DataColumn(_field, data), closeColumn: true);
}
else
writer.WriteColumn(new DataColumn(_field, _buffers[i]), closeColumn: false);
}
So if I'm reading the code correctly, parquet-dotnet only supports writing a column as a single data page per row group?
I think you can do that theoretically, but writing in chunks will effectively discard any logical compression. You could use smaller row group size and array pooling?
I think you can do that theoretically, but writing in chunks will effectively discard any logical compression. You could use smaller row group size and array pooling?
I am still trying to get a better grasp of the performance implications. I reckon with a reasonable data page size one may still get a good benefit from compression while maintaining fair access speed, hence the recommendation? (8 KB data page)
Until now we have been using another proprietary column based format where we had batches around 4096 items (16 KB for int32) and the overall compression ratio was pretty much the same as if we converted the data to parquet with some other tool. In some specific cases the compression may even improve as one can choose different compression/encoding strategies per data page.
From what I have seen so far, the common practice with parquet is to have one row group per file. Then tools like spark are optimized to scale horizontally while sending these row groups to allocated nodes for separate processing. Or am I getting it wrong?
Actually as I wrote in the https://github.com/aloneguid/parquet-dotnet/issues/137 I already tested it (writing smaller data pages with parquet-dotnet) and it had close to none impacts on compression.