arrow
arrow copied to clipboard
[C++][Parquet] Expand ParquetVersion enum values
Describe the enhancement requested
The latest released Parquet format version is 2.10.0, but our ParquetVersion enum only goes up to 2.6.0. We should fill in the missing values. For example, 2.8.0 adds the BYTE_STREAM_SPLIT encoding for floats.
Component(s)
C++, Parquet
cc @jorisvandenbossche @mapleFU @wgtmac
@pitrou I've considering this problem before, we talked about it here: https://github.com/apache/arrow/issues/35776 , I forgot this previously
Hi, is it still necessary to continue with this issue? If so, I can help.
@diego-ciciani01 Feel free to create a PR :)
Personally I think this would be a bit tricky, what would you plan to be in 2.10?
And some 2.10 files might be written with 2.6.0 now? 🤔
Thanks for the feedback above. I’ve been digging a bit deeper into the issue, and I now understand why simply adding new values to the ParquetVersion enum might not be straightforward.
I think we should start by researching the exact features introduced in each version (2.7-2.10) from the Parquet spec changelog. For the version mismatch issue, we could consider adding a validation step to ensure written files declare the minimum required version for their actual features used, or something like that.
Let me know if I understood what you meant.
I have an in-progress draft PR up for this already, as I had assumed from the good-first-issue label that we just needed to add the version numbers in. It sounds like it's more complex than that, so if that's the case, feel free to take it over in a new PR, as I don't have the capacity to complete those extra bits.