dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Add `slice_rows` to interchange protocol

Open MarcoGorelli opened this issue 1 year ago • 8 comments

closes #204

MarcoGorelli avatar Feb 13 '24 16:02 MarcoGorelli

In the case of something like pandas or other dataframe library that doesn't use the Arrow memory layout under the hood, they'd presumably materialize arrow on the __dataframe__ call and then have to slice the arrow format memory, which if containing strings or has a step size, isn't free. This is already potentially a problem in selecting columns as well, so I guess this inefficiency is nothing new?

Additionally, it makes it a bit hard to reason about when the producer vs when the consumer should do row selection. I.E. if Polars is consuming data from say PyArrow, I imagine Polars would rather handle row slicing itself (assuming you'll hit a situation where it's not pure pointer arithmetic). Now in the situation of Pandas consuming data from say Polars, you'd probably want Polars to handle the row slicing.

Arrow interchange protocols handle the slicing case (ignoring step size) by allowing specifying an offset and a size. Maybe we can do something similar here?

kkraus14 avatar Feb 13 '24 16:02 kkraus14

sounds good, thanks

MarcoGorelli avatar Feb 13 '24 16:02 MarcoGorelli

Do we expect / want to encourage developers using dataframe libraries to explicitly call __dataframe__ themselves as opposed to using libraryx.from_dataframe(...)? It feels a bit funky to me currently that we go from say:

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataFrame(pl_df)

to:

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataframe(pl_df.__dataframe__().select_columns(...).slice_rows(...))

My 2c is that this is just highlighting the lack of standard API here and that the experience should be something along the lines of (ignoring API names for column selection and row slicing):

pl_df = ...  # My polars dataframe
pdf = pandas.from_dataframe(pl_df.cols(...).slice_rows(...))

kkraus14 avatar Feb 13 '24 16:02 kkraus14

Would be good to have others chime in here given this interchange protocol is already being adopted where we probably don't want to introduce something and later decide to change / remove it.

kkraus14 avatar Feb 13 '24 17:02 kkraus14

It's what plotly already does to not have to convert the entire dataframe

MarcoGorelli avatar Feb 13 '24 17:02 MarcoGorelli

Any updates here please?

This is the only thing I plan to try adding to the interchange protocol, promised

I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:

  • select columns (currently possible)
  • select rows (not possible)

cc @rgommers @jorisvandenbossche

MarcoGorelli avatar Feb 22 '24 12:02 MarcoGorelli

gentle ping

(would really like to get this in for pandas 3.0 tbh, and this topic actually has a real world use case https://github.com/microsoft/vscode-jupyter/pull/13951)


this is just highlighting the lack of standard API here

the "standard api" solution would be:

pandas.from_dataframe(pl_df.__dataframe_consortium_standard__().select(...).take(...))

does that really look any less clunky?

MarcoGorelli avatar Feb 27 '24 13:02 MarcoGorelli

I think of the interchange protocol as being useful to converting between libraries and doing some preselection in a standardised way:

select columns (currently possible) select rows (not possible)

The ability to select subset rows in addition to selecting columns seems harmonious.

Implementation in Modin should not be a problem.

+1

anmyachev avatar Apr 03 '24 13:04 anmyachev

Any updates please?

MarcoGorelli avatar Jun 07 '24 13:06 MarcoGorelli

closing due to lack of interest (this PR has been open for 5 months), thanks all for comments

MarcoGorelli avatar Jul 10 '24 12:07 MarcoGorelli