parquet-dotnet icon indicating copy to clipboard operation
parquet-dotnet copied to clipboard

Question: Multiple data pages in row group, row group size

Open hrabeale opened this issue 3 years ago • 6 comments

Version: All

Runtime Version: All

OS: All

Expected behavior

Allow writing column in multiple data pages in one row group

Actual behavior

Whole column is written in a single data page in a row group

Argument

Hi,

I am curious whether the current behavior was an ultimate design decision or it is something open for future development. By looking at the current interface and naming it seems to me like it's intentional and final, i.e. rowGroupWriter.WriteColumn(new DataColumn(field, array));

In the official parquet documentation, they recommend row group size 512-1024 MB, which can often easily mean hundreds of millions rows https://parquet.apache.org/documentation/latest/

  • From reading and processing perspective - having that amount of rows in a single data page may not be optimal for some use cases. (https://github.com/apache/parquet-format: "We recommend 8KB for page sizes") (maybe someone already hit that issue https://github.com/aloneguid/parquet-dotnet/issues/117)
  • From writing perspective - it is not very practical to write everything at once from a single huge allocated array. (related to https://github.com/aloneguid/parquet-dotnet/issues/108)

I could imagine something like this DataColumn(DataField field, Array[] pages)); or this DataPage(DataField field, Array array); ParquetRowGroupWriter.WriteToCurrentColumn(DataPage page); ParquetRowGroupWriter.CloseColumn(); (don't take me by word, that's just from the top of my head)

In the old parquet-dotnet repo there is mentioned a recommendation for 5000 rows, where did that come from? https://github.com/elastacloud/parquet-dotnet/issues/392#issuecomment-456109006

Thank you

hrabeale avatar Nov 25 '21 10:11 hrabeale

I now see that the DataColumnWriter is somehow prepared to write multiple data pages, it's just not used

hrabeale avatar Nov 26 '21 13:11 hrabeale

So I already implemented some proof of concept and it works. Would there be some will to incorporate these changes if I polish them and make a PR? @aloneguid

hrabeale avatar Nov 26 '21 16:11 hrabeale

thanks @hrabeale please go ahead, all PRs are welcome :) I'll review on Monday.

aloneguid avatar Nov 26 '21 17:11 aloneguid

5000 rows is general parquet recommendation from the original Java implementation but it was Hadoop days and not-so-big-data. These days i'd think you would tune it to a reasonable amount as RAM is cheap and available on demand.

It's also hard to know when 8kb limit will be hit due to physical and logical compression which depends on the data itself and will vary from chunk to chunk, sometimes dramatically so it's do and try.

aloneguid avatar Nov 26 '21 17:11 aloneguid

@hrabeale out of interest, why are you using .net library instead of say pyarrow or JVM based?

aloneguid avatar Nov 26 '21 17:11 aloneguid

thanks @hrabeale please go ahead, all PRs are welcome :) I'll review on Monday.

Sorry, it won't be that fast, still it's not the top priority, I might look into it over Christmas.

It's also hard to know when 8kb limit will be hit due to physical and logical compression which depends on the data itself and will vary from chunk to chunk, sometimes dramatically so it's do and try.

Yes, it would have to be estimate based or something like that. But that can be left up for the user. Also, with row group size it's pretty much the same story.

@hrabeale out of interest, why are you using .net library instead of say pyarrow or JVM based?

Most of our codebase is .net with no need for other runtimes so far, we already have a writer interface with writers to other formats, now we just need to stick in yet another implementation for parquet. So this seemed most straightforward.

hrabeale avatar Nov 29 '21 12:11 hrabeale

closing due to inactivity

aloneguid avatar Dec 01 '22 12:12 aloneguid