GH-41579: [C++][Python][Parquet] Support reading/writing key-value metadata from/to ColumnChunkMetaData
Rationale for this change
Parquet standard allows reading/writing key-value metadata from/to ColumnChunkMetaData, but there is no way to do that with Parquet C++.
What changes are included in this PR?
Support reading/writing key-value metadata from/to ColumnChunkMetaData with Parquet C++ reader/writer. Support reading key-value metadata from ColumnChunkMetaData with pyarrow.parquet.
Are these changes tested?
Yes, unit tests are added
Are there any user-facing changes?
Yes.
- Users can read or write key-value metadata for column chunks with Parquet C++.
- Users can read key-value metadata for column chunks with PyArrow.
- parquet-reader tool prints key-value metadata in column chunks when
--print-key-value-metadataoption is used.
- GitHub Issue: #41579
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
In the case of PARQUET issues on JIRA the title also supports:
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
See also:
:warning: GitHub issue #41579 has been automatically assigned in GitHub to PR creator.
I saw a related issue: https://issues.apache.org/jira/browse/PARQUET-1648. It seems that parquet-mr does not use it yet.
It seems that parquet-mr does not use it yet.
Can parquet-mr reads that?
I don't think so.
https://github.com/apache/parquet-mr/blob/c241170d9bc2cd8415b04e06ecea40ed3d80f64d/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1596-L1611
🤔at least this patch is ok and seems other implementation has thrift, it doesn't break the standard..
I agree. It should not block us from implementing this.
@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR?
@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR?
Created https://github.com/apache/parquet-testing/pull/49
@pitrou @clee704 Where does this PR stand? Would really like to see this exposed in the interface
Oh, Python lint failed...
I merged https://github.com/apache/parquet-testing/pull/49 , you can adjust the link here
Beside a Py Lint has failed:
Python Format............................................................Failed
- hook id: flake8
- exit code: 1
python/pyarrow/tests/parquet/test_metadata.py:788:89: E501 line too long (97 > 88 characters)
@clee704 Sorry for delaying, I've update this patch myself. This patch LGTM and I'm willing to merge it before August 6th. Would you mind re-checking this?
Would check CPython build
Error compiling Cython file:
------------------------------------------------------------
...
return self.metadata.GetColumnIndexLocation().has_value()
@property
def metadata(self):
"""Additional metadata as key value pairs (dict[bytes, bytes])."""
wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
^
------------------------------------------------------------
pyarrow/_parquet.pyx:514:18: undeclared name not builtin: pyarrow_wrap_metadata
Error compiling Cython file:
------------------------------------------------------------
...
return self.metadata.GetColumnIndexLocation().has_value()
@property
def metadata(self):
"""Additional metadata as key value pairs (dict[bytes, bytes])."""
wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
^
------------------------------------------------------------
pyarrow/_parquet.pyx:514:72: Cannot convert 'shared_ptr[const CKeyValueMetadata]' to Python object
[17/136] Compiling Cython CXX source for _dataset...
performance hint: pyarrow/_dataset.pyx:1947:35: Exception check after calling 'GetResultValue[shared_ptr[CRandomAccessFile]]' will always require the GIL to be acquired. Declare 'GetResultValue[shared_ptr[CRandomAccessFile]]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
performance hint: pyarrow/_dataset.pyx:3343:41: Exception check after calling 'GetResultValue[shared_ptr[CRecordBatch]]' will always require the GIL to be acquired. Declare 'GetResultValue[shared_ptr[CRecordBatch]]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
performance hint: pyarrow/_dataset.pyx:3387:34: Exception check after calling 'GetResultValue[CTaggedRecordBatch]' will always require the GIL to be acquired. Declare 'GetResultValue[CTaggedRecordBatch]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
performance hint: pyarrow/_dataset.pyx:3784:42: Exception check after calling 'GetResultValue[CTaggedRecordBatchIterator]' will always require the GIL to be acquired. Declare 'GetResultValue[CTaggedRecordBatchIterator]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
[18/136] Building CXX object CMakeFiles/arrow_python.dir/pyarrow/src/arrow/python/datetime.cc.o
@wgtmac @pitrou Do you have any more comments here? Lets move forward now
Would merge this week if no negative comments
@pitrou Would you mind take a look at this? I'm planning to merge this pr this week if no negative comments
Sorry, I approved while the two comments should probably be addressed before merging this.
Thanks all, merged!
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 2767dc55cb41377af6895141f717475d73b2892d.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.