dataframe-api
dataframe-api copied to clipboard
Add methods for the Arrow PyCapsule Protocol to DataFrame/Column interchange protocol objects
Addresses https://github.com/data-apis/dataframe-api/issues/279
Currently, I added schema and array support to Column
, and schema and stream support to DataFrame
. I think that's the typical way those objects are used (chunked dataframe, gets chunks+columns, and then single-chunk column), but for completeness (and because the spec doesn't distinguish between single vs multi-chunk objects) we can probably add both the array and stream method to both the DataFrame and Column object.
I also added the schema method to both DataFrame and Column, although those are strictly speaking not "schema" objects. But we don't have a dedicated schema object in the interchange spec (DataFrame has no attribute, and the Column.dtype attribute is a plain tuple)
What should the behavior be if the library has its data on say a GPU instead of CPU?
Throw? Copy to CPU?
If we support this CPU only protocol why don't we support other CPU only protocols as well?
What should the behavior be if the library has its data on say a GPU instead of CPU? Throw? Copy to CPU?
That's one thing I still needed to add: if you do not support this protocol, I would propose to indicate this by not having the methods (so you get do hasattr(obj, "..")
as a proxy for checking if the object supports this protocol).
So if we decide that GPU libraries shouldn't support this, then they should simply not add those methods.
Personally I don't have much experience with GPU, but based on following earlier discussions, I think the general preference is that there is no silent copy?
Certainly given the fact we want to expand this protocol to support GPU data as well (https://github.com/apache/arrow/issues/38325), I think not implementing this right now is the best option for objects backed by GPU data.
If we support this CPU only protocol why don't we support other CPU only protocols as well?
Which other CPU-only protocol are you thinking about? I am not aware of any other dataframe protocol. (there are of course array protocols, which you might be hinting on, but we already do have DLPack on the buffer/array level that already supports devices. So adding another array protocol at that level is much less needed given that most array libraries nowadays support DLPack. Personally I think adding general Python buffer protocol support to Buffer would be nice, though)
So if we decide that GPU libraries shouldn't support this, then they should simply not add those methods.
I agree - if they'd start copying data to host memory now to support the CPU-only protocol, that would make it much harder to later add the device-agnostic version.
If there was movement and a timeline on https://github.com/apache/arrow/issues/38325, it'd be easier to say "let's not do the CPU-only version". However, it looks like that may take quite a while. Somehow work on these protocols is always slow going (DLPack was/is no exception). So it seems reasonable to me to start supporting the CPU protocol now.
There is indeed the fragmentation on the array side, with too many different CPU-only protocols. However, on the dataframe side things are cleaner. If things could stay limited to:
-
__dataframe__
andfrom_dataframe
at the Python API level - the Arrow CPU-only protocol as dunder methods
- the Arrow device-agnostic protocol as dunder methods
then it looks like things are in pretty good shape.
I am still a strong -1 on this proposal.
The array API didn't implement __array__
or __array_interface__
despite being more widely adopted and used than the equivalent Arrow protocols being added here. It wasn't implemented because they didn't want to fragment implementations or cause downstream pain in leveraging non-CPU, non-numpy implementations.
We have already defined a new interchange protocol that can handle Arrow formatted data zero copy in addition to handling non-CPU memory and non-arrow memory layouts.
That's OK
@jorisvandenbossche we can still use this in pandas/polars/pyarrow interchange.from_dataframe
, e.g.
def from_dataframe(df: SupportsInterchange, *, allow_copy: bool = True) -> DataFrame:
if hasattr(df, '__arrow_c_schema__'):
return fastpath_constructor(df, allow_copy=allow_copy)
return _from_dataframe(
df.__dataframe__(allow_copy=allow_copy), # type: ignore[arg-type]
allow_copy=allow_copy,
)
even if it's not officially part of the interchange protocol
The array API didn't implement
__array__
or__array_interface__
despite being more widely adopted and used than the equivalent Arrow protocols being added here. It wasn't implemented because they didn't want to fragment implementations or cause downstream pain in leveraging non-CPU, non-numpy implementations.
I understand why the array API didn't implement those, but not sure how that is relevant for this discussion? I am not proposing to add this protocol based on some popularity metric, but because it is adding missing features and convenience. The array protocols you mentioned also wouldn't add much additional value for the Array API because it already had DLPack covering the same. For the DataFrame interchange protocol, though, that is not the case. What I am proposing here has additional value on top of what we already have (allow accessing the data at DataFrame/Column level instead of only at Buffer level + exposing the data through a stable C ABI interface, similarly as DLPack does for the arrays)
@kkraus14 what is not clear to me, is your objection mainly based on the lack of non-cpu device support, or more in general? Assume for a moment that the Arrow community already resolved the discussion around adding device support for the Arrow PyCapsule protocol, and we could directly add it here as well. Would you then be OK with it?
we can still use this in pandas/polars/pyarrow interchange.from_dataframe, ... even if it's not officially part of the interchange protocol
@MarcoGorelli that's certainly possible and would already improve that part of the interchange's usecases, but that still doesn't enable usage of the interchange protocol in places that require Arrow memory (using duckdb as the typical example, they support Arrow memory, and doubt they would ever support the interchange protocol directly. They could of course already easily support it if they want by depending on pyarrow, but that dependency isn't needed with this PR)
@kkraus14 what is not clear to me, is your objection mainly based on the lack of non-cpu device support, or more in general? Assume for a moment that the Arrow community already resolved the discussion around adding device support for the Arrow PyCapsule protocol, and we could directly add it here as well. Would you then be OK with it?
I would be more okay with it, but I still feel funny about giving Arrow memory layout arguably preferential treatment over non Arrow memory layout, though we already do that for strings more or less.
I.E. if someone uses a sentinel value or a byte mask for representing nulls, should they not support this protocol or should they support it via a copy + conversion? Both are unideal from the perspective of a library not using Arrow format for representing nulls supporting a standard.
I still feel funny about giving Arrow memory layout arguably preferential treatment over non Arrow memory layout, though we already do that for strings more or less.
We indeed already use Arrow for several things. But I would also say that Arrow deserves this special treatment because of being the most widely adopted standard for representing tabular data?
I.E. if someone uses a sentinel value or a byte mask for representing nulls, should they not support this protocol or should they support it via a copy + conversion? Both are unideal from the perspective of a library not using Arrow format for representing nulls supporting a standard.
If the consuming library doesn't use Arrow compatible memory, they don't need to consume the dataframe object through this protocol. The existing Column Buffers access is still there, allowing you to get the data exactly as they are.
For the producing side, if the dataframe library doesn't use Arrow compatible memory, they indeed need to make a copy if they want to support this protocol. But that copy would need to be made anyhow, if the consumer wants Arrow data. (and this is what we did for pandas implementing this protocol directly on pd.DataFrame, i.e. depending on the data (type), it will not always be zero-copy. But note that the current interchange protocol is also not always zero-copy for pandas, in case of string columns, or in case of columns backed by a 2D block with row-major layout)