iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

[feature] Investigate integrations leveraging the PyCapsule protocol

Open kevinjqliu opened this issue 10 months ago • 3 comments

Feature Request / Improvement

Context: https://github.com/apache/iceberg-python/pull/1614#issuecomment-2641089912 Copying the comment over: Separately, rather than adding more library-specific conversion code, it might make sense for pyiceberg to start leveraging the PyCapsule protocol to allow any third party library (dataframe or otherwise) that supports Arrow data to seamlessly consume pyiceberg constructs.

Polars already supports the PyCapsule interface. See https://docs.pola.rs/user-guide/misc/arrow/#using-the-arrow-pycapsule-interface for details.

Implementing the interface on e.g. pyiceberg tables would allow them to be passed directly to dataframe init in polars, just like you can do a pyarrow table today. It also doesn't assume anything about polars support/doesn't add a dependency on polars.

cc @corleyma if you would like to provide more context :)

kevinjqliu avatar Feb 12 '25 19:02 kevinjqliu

I think this would be a great addition to the library and open up support for integration with a large variety of dataframe tools.

Given PyIceberg already has a .to_arrow() method, maybe we can start by using that for the implementation of the dunders?

def __arrow_c_stream__(self, requested_schema):
    return self.scan().to_arrow().__arrow_c_stream__(requested_schema)

WillAyd avatar Mar 19 '25 18:03 WillAyd

@WillAyd thanks for the suggestion! I haven't investigate this yet. But i see __arrow_c_stream__ docs here We have both to_arrow and to_arrow_batch_reader() for table scans.

Do you know how __arrow_c_stream__ is used by other clients? How is it used after we add this function?

kevinjqliu avatar Mar 26 '25 17:03 kevinjqliu

Using the terminology from the Arrow standard, the presence of __arrow_c_stream__ on an object would signal that you are a producer of Arrow data. A consumer may inspect your Python object, and upon detecting that dunder, know that your object implements the Arrow stream protocol.

As far as the usage is concerned, you would still maintain ownership of the lifecycle of that data, regardless of how many consumers try to view it. So from that perspective you typically don't need to worry about what consumers are doing with it, unless you are doing something highly specialized.

The most current listing I am aware of of libraries that use the PyCapsule protocol is available here:

https://github.com/apache/arrow/issues/39195#issuecomment-2245718008

By implementing __arrow_c_stream__ you would let most if not all of those libraries have a zero-copy way of accessing your data (assuming your implementation itself is zero-copy; I don't know how .scan() works)

WillAyd avatar Mar 26 '25 17:03 WillAyd

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Dec 01 '25 00:12 github-actions[bot]

This is still a desired piece of functionality I believe

kylebarron avatar Dec 01 '25 19:12 kylebarron