arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-16430: [Python] Add support for reading record batch custom metadata API

Open niyue opened this issue 2 years ago • 18 comments

In ARROW-16131, C++ APIs were added so that users can read/write record batch custom metadata for IPC file. In this PR, pyarrow APIs are added so that python users can take advantage of these APIs to address ARROW-16430.

niyue avatar May 01 '22 06:05 niyue

https://issues.apache.org/jira/browse/ARROW-16430

github-actions[bot] avatar May 01 '22 06:05 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar May 01 '22 06:05 github-actions[bot]

By the way, this PR addresses RecordBatchFileReader but not RecordBatchReader. Do you plan to do that as well?

pitrou avatar May 10 '22 10:05 pitrou

By the way, this PR addresses RecordBatchFileReader but not RecordBatchReader. Do you plan to do that as well?

I tried adding the pyarrow API for RecordBatchReader locally previously, but I found if I want to test the new pyarrow API for RecordBatchReader, I probably have to add the corresponding API implementation in C++ RecordBatchStreamReader so that I can unit test it in a meaningful way since it seems only RecordBatchStreamReader and RecordBatchFileReader provides the capability to read record batch custom metadata (is this correct?). And since this PR is more about adding new pyarrow API for existing C++ capability, I would like to make non necessary C++ API change less in this PR.

  1. if there is other way to unit test pyarrow RecordBatchReader.read_next_batch_with_metadata API besides implementing it in RecordBatchStreamReader?
  2. do you recommend to put all these changes together in single PR or keep them as separated PRs? I initially submitted this issue as a pyarrow enhancement PR, but I am okay to include more API changes if it is preferred.

niyue avatar May 12 '22 07:05 niyue

@pitrou I made some additional changes to the PR, could you please help to review? Thanks.

niyue avatar May 17 '22 02:05 niyue

@niyue Sorry for the late replay, but the C++ RecordBatchReader already has a method Result<RecordBatchWithMetadata> ReadNext(), so you wouldn't need to add anything more to the C++ side AFAICT?

pitrou avatar May 18 '22 15:05 pitrou

@niyue Are you planning to work on this? Otherwise @milesgranger you might be interested in taking this up?

pitrou avatar Aug 09 '22 09:08 pitrou

I can certainly pick it up if needed! Presently, I'm working locally on ARROW-13763.

milesgranger avatar Aug 09 '22 09:08 milesgranger

@niyue Are you planning to work on this?

Otherwise @milesgranger you might be interested in taking this up?

Hi @pitrou, sorry I got too busy since last commit and forgot this issue previously. I am still interested working on this, and can start working on it from next week.

niyue avatar Aug 09 '22 11:08 niyue

@niyue If/when this is ready for review, please say so :-)

pitrou avatar Aug 17 '22 10:08 pitrou

@pitrou I pushed a new commit a minute ago. I followed your suggestion to add corresponding API for RecordBatchReader API in pyarrow, and a basic unit test to cover this API. I ran all pyarrow tests (using the minimum build only) and all tests passed locally. Could you please help to review? Thanks.

niyue avatar Aug 17 '22 10:08 niyue

@pitrou I am not sure if you see my previous comment. I made some change previously, and this PR is ready for review, could you please help? Thanks.

niyue avatar Sep 08 '22 22:09 niyue

@niyue Sorry, I had overlooked it. I'll take a look when I can. @jorisvandenbossche would you like to review this too?

pitrou avatar Sep 12 '22 14:09 pitrou

No problem. I will check them out and see if I can get them addressed. Thanks for the review.

niyue avatar Sep 22 '22 23:09 niyue

There were some timeouts in some CI jobs. I've restarted them just in case. If that still reproduces, should first rebase from master to catch up with any upstream fixes.

pitrou avatar Oct 05 '22 09:10 pitrou

This looks like a nice addition. Something I am wondering more in general (potentially for future JIRA/PRs), when working with custom metadata, would it be useful to also allow to inspect the custom metadata of a certain batch, without also loading the batch? (so you could for example check the metadata before deciding whether to read the batch or not). I suppose that would only make sense for the IPC file format? (although not sure if that makes sense at the format / C++ level) Like the RecordBatchFileReader now has a get_batch(i)/ get_batch_with_custom_metadata(i), one could also have a get_custom_metadata(i)?

jorisvandenbossche avatar Oct 05 '22 09:10 jorisvandenbossche

Something I am wondering more in general (potentially for future JIRA/PRs), when working with custom metadata, would it be useful to also allow to inspect the custom metadata of a certain batch, without also loading the batch?

Perhaps, but that would have to be done on the C++ side first. Also, I would wait for users to actually request it.

pitrou avatar Oct 05 '22 09:10 pitrou

@pitrou sorry for the late response. There are two failed CI builds, and they don't seem relevant in this issue. I rebased onto the latest master branch anyway.

niyue avatar Oct 10 '22 06:10 niyue

@pitrou could you please help to review the PR to see if there is still anything we desire to change? Thanks.

niyue avatar Nov 02 '22 08:11 niyue

No problem @pitrou . Thanks for the review and the fixes.

niyue avatar Dec 13 '22 04:12 niyue