arrow
arrow copied to clipboard
ARROW-16430: [Python] Add support for reading record batch custom metadata API
In ARROW-16131, C++ APIs were added so that users can read/write record batch custom metadata for IPC file. In this PR, pyarrow APIs are added so that python users can take advantage of these APIs to address ARROW-16430.
https://issues.apache.org/jira/browse/ARROW-16430
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
By the way, this PR addresses RecordBatchFileReader
but not RecordBatchReader
. Do you plan to do that as well?
By the way, this PR addresses
RecordBatchFileReader
but notRecordBatchReader
. Do you plan to do that as well?
I tried adding the pyarrow API for RecordBatchReader
locally previously, but I found if I want to test the new pyarrow API for RecordBatchReader
, I probably have to add the corresponding API implementation in C++ RecordBatchStreamReader
so that I can unit test it in a meaningful way since it seems only RecordBatchStreamReader
and RecordBatchFileReader
provides the capability to read record batch custom metadata (is this correct?). And since this PR is more about adding new pyarrow API for existing C++ capability, I would like to make non necessary C++ API change less in this PR.
- if there is other way to unit test pyarrow
RecordBatchReader.read_next_batch_with_metadata
API besides implementing it inRecordBatchStreamReader
? - do you recommend to put all these changes together in single PR or keep them as separated PRs? I initially submitted this issue as a pyarrow enhancement PR, but I am okay to include more API changes if it is preferred.
@pitrou I made some additional changes to the PR, could you please help to review? Thanks.
@niyue Sorry for the late replay, but the C++ RecordBatchReader
already has a method Result<RecordBatchWithMetadata> ReadNext()
, so you wouldn't need to add anything more to the C++ side AFAICT?
@niyue Are you planning to work on this? Otherwise @milesgranger you might be interested in taking this up?
I can certainly pick it up if needed! Presently, I'm working locally on ARROW-13763.
@niyue Are you planning to work on this?
Otherwise @milesgranger you might be interested in taking this up?
Hi @pitrou, sorry I got too busy since last commit and forgot this issue previously. I am still interested working on this, and can start working on it from next week.
@niyue If/when this is ready for review, please say so :-)
@pitrou I pushed a new commit a minute ago. I followed your suggestion to add corresponding API for RecordBatchReader
API in pyarrow, and a basic unit test to cover this API. I ran all pyarrow tests (using the minimum build only) and all tests passed locally. Could you please help to review? Thanks.
@pitrou I am not sure if you see my previous comment. I made some change previously, and this PR is ready for review, could you please help? Thanks.
@niyue Sorry, I had overlooked it. I'll take a look when I can. @jorisvandenbossche would you like to review this too?
No problem. I will check them out and see if I can get them addressed. Thanks for the review.
There were some timeouts in some CI jobs. I've restarted them just in case. If that still reproduces, should first rebase from master to catch up with any upstream fixes.
This looks like a nice addition. Something I am wondering more in general (potentially for future JIRA/PRs), when working with custom metadata, would it be useful to also allow to inspect the custom metadata of a certain batch, without also loading the batch? (so you could for example check the metadata before deciding whether to read the batch or not).
I suppose that would only make sense for the IPC file format? (although not sure if that makes sense at the format / C++ level) Like the RecordBatchFileReader now has a get_batch(i)
/ get_batch_with_custom_metadata(i)
, one could also have a get_custom_metadata(i)
?
Something I am wondering more in general (potentially for future JIRA/PRs), when working with custom metadata, would it be useful to also allow to inspect the custom metadata of a certain batch, without also loading the batch?
Perhaps, but that would have to be done on the C++ side first. Also, I would wait for users to actually request it.
@pitrou sorry for the late response. There are two failed CI builds, and they don't seem relevant in this issue. I rebased onto the latest master branch anyway.
@pitrou could you please help to review the PR to see if there is still anything we desire to change? Thanks.
No problem @pitrou . Thanks for the review and the fixes.