parquet-dotnet
parquet-dotnet copied to clipboard
Question: Multiple data pages in row group, row group size
Version: All
Runtime Version: All
OS: All
Expected behavior
Allow writing column in multiple data pages in one row group
Actual behavior
Whole column is written in a single data page in a row group
Argument
Hi,
I am curious whether the current behavior was an ultimate design decision or it is something open for future development.
By looking at the current interface and naming it seems to me like it's intentional and final, i.e.
rowGroupWriter.WriteColumn(new DataColumn(field, array));
In the official parquet documentation, they recommend row group size 512-1024 MB, which can often easily mean hundreds of millions rows https://parquet.apache.org/documentation/latest/
- From reading and processing perspective - having that amount of rows in a single data page may not be optimal for some use cases. (https://github.com/apache/parquet-format: "We recommend 8KB for page sizes") (maybe someone already hit that issue https://github.com/aloneguid/parquet-dotnet/issues/117)
- From writing perspective - it is not very practical to write everything at once from a single huge allocated array. (related to https://github.com/aloneguid/parquet-dotnet/issues/108)
I could imagine something like this
DataColumn(DataField field, Array[] pages));
or this
DataPage(DataField field, Array array);
ParquetRowGroupWriter.WriteToCurrentColumn(DataPage page);
ParquetRowGroupWriter.CloseColumn();
(don't take me by word, that's just from the top of my head)
In the old parquet-dotnet repo there is mentioned a recommendation for 5000 rows, where did that come from? https://github.com/elastacloud/parquet-dotnet/issues/392#issuecomment-456109006
Thank you
I now see that the DataColumnWriter is somehow prepared to write multiple data pages, it's just not used
So I already implemented some proof of concept and it works. Would there be some will to incorporate these changes if I polish them and make a PR? @aloneguid
thanks @hrabeale please go ahead, all PRs are welcome :) I'll review on Monday.
5000 rows is general parquet recommendation from the original Java implementation but it was Hadoop days and not-so-big-data. These days i'd think you would tune it to a reasonable amount as RAM is cheap and available on demand.
It's also hard to know when 8kb limit will be hit due to physical and logical compression which depends on the data itself and will vary from chunk to chunk, sometimes dramatically so it's do and try.
@hrabeale out of interest, why are you using .net library instead of say pyarrow or JVM based?
thanks @hrabeale please go ahead, all PRs are welcome :) I'll review on Monday.
Sorry, it won't be that fast, still it's not the top priority, I might look into it over Christmas.
It's also hard to know when 8kb limit will be hit due to physical and logical compression which depends on the data itself and will vary from chunk to chunk, sometimes dramatically so it's do and try.
Yes, it would have to be estimate based or something like that. But that can be left up for the user. Also, with row group size it's pretty much the same story.
@hrabeale out of interest, why are you using .net library instead of say pyarrow or JVM based?
Most of our codebase is .net with no need for other runtimes so far, we already have a writer interface with writers to other formats, now we just need to stick in yet another implementation for parquet. So this seemed most straightforward.
closing due to inactivity