Allow Arrow Capsule Interface
Apache Iceberg version
0.10.0 (latest release)
Please describe the bug 🐞
Due to how iceberg-python does checks in certain places, I can't use libraries such as arro3 or polars without having to do a conversion and include pyarrow as a dependency. Here is such a case in table/__init__.py:
def append(self, df: pa.Table, snapshot_properties: Dict[str, str] = EMPTY_DICT, branch: Optional[str] = MAIN_BRANCH) -> None:
"""
Shorthand API for appending a PyArrow table to a table transaction.
Args:
df: The Arrow dataframe that will be appended to overwrite the table
snapshot_properties: Custom properties to be added to the snapshot summary
branch: Branch Reference to run the append operation
"""
try:
import pyarrow as pa
except ModuleNotFoundError as e:
raise ModuleNotFoundError("For writes PyArrow needs to be installed") from e
from pyiceberg.io.pyarrow import _check_pyarrow_schema_compatible, _dataframe_to_data_files
if not isinstance(df, pa.Table):
raise ValueError(f"Expected PyArrow table, got: {df}")
Can this be updated to use the capsule interface: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html ?
I can create a patch if this is something that will be accepted. Sorry for the new account, due to employer issues I can't use my "regular" one.
Willingness to contribute
- [x] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
Thanks for raising this @DisturbedOcean I believe this is similar to #1655
Please feel free to give it a try
@kevinjqliu Sorry - what am I supposed to try? The problem is the isinstance checks in pyiceberg from what I can tell.
Feel free to submit a PR
pyiceberg is very much coupled with pyarrow right now. Would be good to decouple it and support the Arrow Capsule interface.
I see arro3 has the Table object, whats the best way to abstract the isinstance check so that its not solely dependent on pyarrow?
The point of the PyCapsule Interface is to not have your API be stuck/tied to any one library implementation. So instead of taking in a table: pyarrow.Table, you should take in an ArrowStreamExportable, defined as
class ArrowStreamExportable(Protocol):
def __arrow_c_stream__(
self,
requested_schema: object | None = None
) -> object:
...
So for any input object that advertises an __arrow_c_stream__ method, you can import its data into your internal Arrow implementation of choice using, say, pyarrow.table(input_object)