parquet-java
parquet-java copied to clipboard
PARQUET-869: Configurable record counts for block size checks
This PR adds on #447 and updates the properties to use "row group" instead of "block" because block is confusing. It also fixes the outstanding review comments so this can be merged.
Closes #447.
@gszadovszky, ParquetProperties isn't considered part of the public API. We have needed to change it over time. See https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8#diff-84c5123768d3411558f1761aca7559a9 for an example.
I can accept that but how would the API client know that? We already know some modifications in the "internal API" which caused problems to our clients. Until we solve the segregation of the internal and public API I would suggest to try to not break the compatibility of any publicly accessible members of the java API.
None of the org.apache.parquet.column classes are public (see https://github.com/apache/parquet-mr/blob/master/pom.xml#L250). I know it is annoying to not have a public API, but I think it is much worse to slow development to maintain compatibility on internal classes than to break the few people who were for some unknown reason using an internal API with little use outside of the project.
I don't agree with a requirement for full binary compatibility across the entire codebase because we lack a public API. We already have significant drag from maintaining compatibility in the classes that are public and I think it's a bad idea to introduce that problem everywhere.
Let's work on a public API if not having one is going to prevent us from making reasonable changes to internal classes.
@julienledem, any thoughts on this?
@zivanfi Will there be any progress on this fix? Note disclosure on https://eng.uber.com/petastorm/ ... sounds like people are forking parquet in order to get around this bug.
We have a large number of images that need to be stored through parquet, but encountered the above situation, I hope to promote this optimization point as soon as possible, which is very helpful for our work, thanks
Same here, we have 1 or 2 columns that can vary widely in size (few Kbs up to 10Mb) and we often stumble upon an OutOfMemory error because it didn't check the buffered rows in time.
Being able to adjust the checks frequency would be a huge help 👍
I have a rebased branch against master if anyone interested
Well, I'm glad that I have found this bug before I started to save images into parquet files.
@livelace If at one point you consider using parquet with binary files (or any big columns) know that increasing the frequency of checks may not be enough (it was not for me).
I had to fork parquet to be able to opt-out the computation of statistics for some of my columns, see:
https://issues.apache.org/jira/browse/PARQUET-1911
From this moment I never had any OOM issue.