arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-41579: [C++][Python][Parquet] Support reading/writing key-value metadata from/to ColumnChunkMetaData

Open clee704 opened this issue 1 year ago • 10 comments

Rationale for this change

Parquet standard allows reading/writing key-value metadata from/to ColumnChunkMetaData, but there is no way to do that with Parquet C++.

What changes are included in this PR?

Support reading/writing key-value metadata from/to ColumnChunkMetaData with Parquet C++ reader/writer. Support reading key-value metadata from ColumnChunkMetaData with pyarrow.parquet.

Are these changes tested?

Yes, unit tests are added

Are there any user-facing changes?

Yes.

  • Users can read or write key-value metadata for column chunks with Parquet C++.
  • Users can read key-value metadata for column chunks with PyArrow.
  • parquet-reader tool prints key-value metadata in column chunks when --print-key-value-metadata option is used.
  • GitHub Issue: #41579

clee704 avatar May 08 '24 01:05 clee704

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar May 08 '24 01:05 github-actions[bot]

:warning: GitHub issue #41579 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar May 08 '24 01:05 github-actions[bot]

I saw a related issue: https://issues.apache.org/jira/browse/PARQUET-1648. It seems that parquet-mr does not use it yet.

wgtmac avatar May 12 '24 15:05 wgtmac

It seems that parquet-mr does not use it yet.

Can parquet-mr reads that?

mapleFU avatar May 12 '24 15:05 mapleFU

I don't think so.

https://github.com/apache/parquet-mr/blob/c241170d9bc2cd8415b04e06ecea40ed3d80f64d/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1596-L1611

wgtmac avatar May 12 '24 15:05 wgtmac

🤔at least this patch is ok and seems other implementation has thrift, it doesn't break the standard..

mapleFU avatar May 12 '24 15:05 mapleFU

I agree. It should not block us from implementing this.

wgtmac avatar May 12 '24 15:05 wgtmac

@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR?

pitrou avatar May 13 '24 15:05 pitrou

@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR?

Created https://github.com/apache/parquet-testing/pull/49

clee704 avatar May 29 '24 02:05 clee704

@pitrou @clee704 Where does this PR stand? Would really like to see this exposed in the interface

arsnyder16 avatar Jun 27 '24 15:06 arsnyder16

Oh, Python lint failed...

mapleFU avatar Jul 15 '24 03:07 mapleFU

I merged https://github.com/apache/parquet-testing/pull/49 , you can adjust the link here

mapleFU avatar Jul 21 '24 07:07 mapleFU

Beside a Py Lint has failed:

Python Format............................................................Failed
- hook id: flake8
- exit code: 1

python/pyarrow/tests/parquet/test_metadata.py:788:89: E501 line too long (97 > 88 characters)

mapleFU avatar Jul 22 '24 08:07 mapleFU

@clee704 Sorry for delaying, I've update this patch myself. This patch LGTM and I'm willing to merge it before August 6th. Would you mind re-checking this?

mapleFU avatar Aug 03 '24 14:08 mapleFU

Would check CPython build

 Error compiling Cython file:
  ------------------------------------------------------------
  ...
          return self.metadata.GetColumnIndexLocation().has_value()

      @property
      def metadata(self):
          """Additional metadata as key value pairs (dict[bytes, bytes])."""
          wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
                    ^
  ------------------------------------------------------------

  pyarrow/_parquet.pyx:514:18: undeclared name not builtin: pyarrow_wrap_metadata

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          return self.metadata.GetColumnIndexLocation().has_value()

      @property
      def metadata(self):
          """Additional metadata as key value pairs (dict[bytes, bytes])."""
          wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
                                                                          ^
  ------------------------------------------------------------

  pyarrow/_parquet.pyx:514:72: Cannot convert 'shared_ptr[const CKeyValueMetadata]' to Python object
  [17/136] Compiling Cython CXX source for _dataset...
  performance hint: pyarrow/_dataset.pyx:1947:35: Exception check after calling 'GetResultValue[shared_ptr[CRandomAccessFile]]' will always require the GIL to be acquired. Declare 'GetResultValue[shared_ptr[CRandomAccessFile]]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: pyarrow/_dataset.pyx:3343:41: Exception check after calling 'GetResultValue[shared_ptr[CRecordBatch]]' will always require the GIL to be acquired. Declare 'GetResultValue[shared_ptr[CRecordBatch]]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: pyarrow/_dataset.pyx:3387:34: Exception check after calling 'GetResultValue[CTaggedRecordBatch]' will always require the GIL to be acquired. Declare 'GetResultValue[CTaggedRecordBatch]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: pyarrow/_dataset.pyx:3784:42: Exception check after calling 'GetResultValue[CTaggedRecordBatchIterator]' will always require the GIL to be acquired. Declare 'GetResultValue[CTaggedRecordBatchIterator]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  [18/136] Building CXX object CMakeFiles/arrow_python.dir/pyarrow/src/arrow/python/datetime.cc.o

mapleFU avatar Aug 04 '24 04:08 mapleFU

@wgtmac @pitrou Do you have any more comments here? Lets move forward now

mapleFU avatar Aug 07 '24 07:08 mapleFU

Would merge this week if no negative comments

mapleFU avatar Aug 13 '24 16:08 mapleFU

@pitrou Would you mind take a look at this? I'm planning to merge this pr this week if no negative comments

mapleFU avatar Aug 15 '24 13:08 mapleFU

Sorry, I approved while the two comments should probably be addressed before merging this.

pitrou avatar Aug 15 '24 14:08 pitrou

Thanks all, merged!

mapleFU avatar Aug 15 '24 16:08 mapleFU

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 2767dc55cb41377af6895141f717475d73b2892d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.