dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Some comments on interchange API from an Arrow developer

Open wesm opened this issue 3 years ago • 1 comments

hi all, great to see some continued work on this project after the original discussion from last year. I still think it's useful to allow libraries to "throw data over the wall" without forcing eager serialization to a particular format (like pandas or Arrow)


From Column docstring:

    TBD: Arrow has a separate "null" dtype, and has no separate mask concept.
         Instead, it seems to use "children" for both columns with a bit mask,
         and for nested dtypes. Unclear whether this is elegant or confusing.
         This design requires checking the null representation explicitly.

Could you clarify what is confusing? I do not understand the statement 'Instead, it seems to use "children" for both columns with a bit and for nested dtypes.'

Later

         The Arrow design requires checking:
         1. the ARROW_FLAG_NULLABLE (for sentinel values)
         2. if a column has two children, combined with one of those children
            having a null dtype.

         Making the mask concept explicit seems useful. One null dtype would
         not be enough to cover both bit and byte masks, so that would mean
         even more checking if we did it the Arrow way.

You mean the Arrow C interface here. Could you clarify what these other things mean?

  • If ARROW_FLAG_NULLABLE is set, then the first buffer (the validity bitmap) governs nullness.
  • The 2nd item here "if a column has two children, combined with one of those children having a null dtype." does not make sense to me, because nulls in Arrow are exclusively determined by the validity bitmap. Arrow does have a "Null" type, but this represents data whose values are always all null and has no associated buffers.

Re: "One null dtype would not be enough to cover both bit and byte masks, so that would even more checking if we did it the Arrow way.", I don't know what this means, could you clarify?


Column methods

    @property
    def null_count(self) -> Optional[int]:
        """
        Number of null elements, if known.
        Note: Arrow uses -1 to indicate "unknown", but None seems cleaner.
        """
        pass

Here you should indicate that you mean that the Arrow C interface (where the null_count is an int64)


General comments / questions

  • It feels a little unfortunate to go "halfway to Arrow" with adding the string offsets buffer, requiring serialization always for variable-size binary data. I haven't thought through what would be alternatives for string data that do not necessarily force this serialization (similar to the API proposal from https://github.com/wesm/dataframe-protocol/pull/1) — if the goal of this API is to reduce the need to serialize, having some alternative here might be worthwhile
  • It seems like it would be useful to provide some to_* methods on Column (like Column.to_arrow) — I guess that pyarrow could implement a built-in implementation of this interface, but some producers might be able to produce Arrow or NumPy arrays directly and skip the lower-level memory export that's provided here.
  • Is there interest in supporting nested / non-flat data in this interchange API? I notice that they are explicitly excluded from the dtype docstring, but I would encourage you to think about them up front rather than bolting on later.

wesm avatar Aug 30 '21 21:08 wesm

Unfortunately there's very limited alternatives here and I'd argue they're all much less appealing than the Arrow string format. We could support null terminated strings which then doesn't require an offsets buffer. We could also support Pandas style strings of pointers to Python string objects. In theory we could also support C/C++ style vector of strings?

This would become a nightmare for consumers of the protocol to support all of the cases.

  • It seems like it would be useful to provide some to_* methods on Column (like Column.to_arrow) — I guess that pyarrow could implement a built-in implementation of this interface, but some producers might be able to produce Arrow or NumPy arrays directly and skip the lower-level memory export that's provided here.

I was one of main voices against this because it conflates containers and memory. I.E. Arrow Arrays can technically be backed by either CPU or NVIDIA GPU memory. Does calling to_arrow enforce moving to host memory? If we provide a to_numpy function it will inevitably be abused for when people just want a generic array that conforms to some API like the data api array standard.

  • Is there interest in supporting nested / non-flat data in this interchange API? I notice that they are explicitly excluded from the dtype docstring, but I would encourage you to think about them up front rather than bolting on later.

Yes, there is interest, but it was decided to get more basic types out first and then revisit nested types because there's extra complexity. https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol_summary.md#protocol-design-requirements

kkraus14 avatar Aug 30 '21 21:08 kkraus14