iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Allow Arrow Capsule Interface

Open DisturbedOcean opened this issue 1 month ago • 3 comments

Apache Iceberg version

0.10.0 (latest release)

Please describe the bug 🐞

Due to how iceberg-python does checks in certain places, I can't use libraries such as arro3 or polars without having to do a conversion and include pyarrow as a dependency. Here is such a case in table/__init__.py:

def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT, branch: Optional[str] = MAIN_BRANCH) -> None:
        """
        Shorthand API for appending a PyArrow table to a table transaction.

        Args:
            df: The Arrow dataframe that will be appended to overwrite the table
            snapshot_properties: Custom properties to be added to the snapshot summary
            branch: Branch Reference to run the append operation
        """
        try:
            import pyarrow as pa
        except ModuleNotFoundError as e:
            raise ModuleNotFoundError("For writes PyArrow needs to be installed") from e

        from pyiceberg.io.pyarrow import _check_pyarrow_schema_compatible, _dataframe_to_data_files

        if not isinstance(df, pa.Table):
            raise ValueError(f"Expected PyArrow table, got: {df}")

Can this be updated to use the capsule interface: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html ?

I can create a patch if this is something that will be accepted. Sorry for the new account, due to employer issues I can't use my "regular" one.

Willingness to contribute

  • [x] I can contribute a fix for this bug independently
  • [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • [ ] I cannot contribute a fix for this bug at this time

DisturbedOcean avatar Nov 03 '25 00:11 DisturbedOcean

Thanks for raising this @DisturbedOcean I believe this is similar to #1655

Please feel free to give it a try

kevinjqliu avatar Nov 03 '25 16:11 kevinjqliu

@kevinjqliu Sorry - what am I supposed to try? The problem is the isinstance checks in pyiceberg from what I can tell.

DisturbedOcean avatar Nov 03 '25 16:11 DisturbedOcean

Feel free to submit a PR

pyiceberg is very much coupled with pyarrow right now. Would be good to decouple it and support the Arrow Capsule interface. I see arro3 has the Table object, whats the best way to abstract the isinstance check so that its not solely dependent on pyarrow?

kevinjqliu avatar Nov 03 '25 18:11 kevinjqliu

The point of the PyCapsule Interface is to not have your API be stuck/tied to any one library implementation. So instead of taking in a table: pyarrow.Table, you should take in an ArrowStreamExportable, defined as

class ArrowStreamExportable(Protocol):
    def __arrow_c_stream__(
        self,
        requested_schema: object | None = None
    ) -> object:
        ...

So for any input object that advertises an __arrow_c_stream__ method, you can import its data into your internal Arrow implementation of choice using, say, pyarrow.table(input_object)

kylebarron avatar Dec 01 '25 19:12 kylebarron