dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Allow to reconstruct a library-specific DataFrame object from an interchange object

Open jorisvandenbossche opened this issue 3 years ago • 7 comments

From the discussions at EuroScipy with scikit-learn developers (cc @ogrisel), the following use case came to mind: assume you have a method that transforms your data, a workflow could be:

  1. accept any dataframe library object as input
  2. using the interchange protocol, robustly access the buffer for a column (eg as a numpy array) and transform the array
  3. reconstruct a dataframe object (same type as the input) as a return value
    1. put the transformed array back into (a copy of) the interchange protocol object, or construct a new protocol object from scratch
    2. given an interchange protocol object, create a library dataframe object (so calling from_dataframe from the input object's library)

This last step is currently not possible, because you don't (want to) know each possible library that implements __dataframe__ and where its from_dataframe lives.

This is very much related with a possible "namespace" like the array api uses (cfr https://github.com/data-apis/dataframe-api/issues/79). With that this could look like:

df_obj = df.__dataframe__()
... # transform data in df_obj
df_ns = df.__dataframe_namespace__()
return df_ns.from_dataframe(df_obj_transformed)

But we could also think about (shorter-term) alternatives directly tied to the interchange protocol object. For example, we could have a class method or attribute that points to the from_dataframe method of the library that created the object.

jorisvandenbossche avatar Sep 13 '22 13:09 jorisvandenbossche

Thanks for the summary @jorisvandenbossche. @thomasjpfan might also be interested in this discussion.

ogrisel avatar Sep 13 '22 13:09 ogrisel

That seems like a very useful thing to support indeed.

But we could also think about (shorter-term) alternatives directly tied to the interchange protocol object. For example, we could have a class method or attribute that points to the from_dataframe method of the library that created the object.

I'm inclined to go this route, for a couple of reasons:

  1. The separate namespace may never come into existence. It would be analogous to how it's done in the array API standard, but we have discussed other options like a context manager to switch to "compliant mode".
  2. While from_dataframe seems perfectly reasonable to me, at least @kkraus14 had hesitations and did not want to expose such a function - at least yet. So best not to require a given name right now.
  3. Timing: this is quite easy to implement as a method, and adding a namespace is a bigger ask.

The signature that this constructor method should have is not 100% obvious though. Maybe the input dataframe has properties that need preserving (e.g. a _meta field), so from_dataframe(new_obj) doesn't quite cut it there.

And a separate question: given that the shape of the returned dataframe may be different from the input shape, and column names etc. may be different, does the user (scikit-learn here) need any functionality to construct new dataframe interchange objects? Or are we expecting them to reinvent the wheel there?

rgommers avatar Sep 13 '22 18:09 rgommers

The scikit-learn "transformer" use case would only need a standard way to call from_dataframe in addition to the existing __dataframe__.

However other parts of scikit-learn would benefit from a standard API:

  • a. indexing a subdataframe/view by column names would be useful to horizontally dispatch data by column subsets in the ColumnTransformer class (and maybe other in the future)
  • b. positional fancy indexing by row (like pandas' .iloc[position_idx]) would be useful to implement device agnostic row-wise resampling e.g. for cross-validation on a cuDF dataset (see for instance: https://github.com/scikit-learn/scikit-learn/issues/14036)

Note that if a standard from_dataframe API is specified then it's possible to implement a. (column based indexing) in the consumer library (e.g. scikit-learn) although it would be nice to have a convenience API to do so.

Implementing b. in the consumer library without a standard API and without triggering back and forth data movement between host & device memory (e.g. for cuDF) will be more challenging I think.

ogrisel avatar Sep 13 '22 21:09 ogrisel

indexing a subdataframe/view by column names would be useful to horizontally dispatch data by column subsets in the ColumnTransformer class (and maybe other in the future)

@ogrisel I'm trying to interpret the "subdataframe/view" part here, but I'm not 100% sure what you mean. The current protocol has a select_columns_by_name method, so that addresses the "indexing by columns names" need. Can you elaborate?

positional fancy indexing by row (like pandas' .iloc[position_idx])

That is worth considering for the protocol perhaps. A purely positional get_rows in analogy to get_columns is perhaps not unreasonable to add, assuming that there are no foot guns. I cannot remember if this was discussed before.

rgommers avatar Sep 15 '22 15:09 rgommers

@ogrisel do note that, as there are now very few libraries actually supporting the protocol, you can rather easily introspect which library the interchange object belongs to, and then import the correct from_dataframe function. That is definitely not a standartized way of doing things, but it could get you moving while consortium is working on the proper API.

Something like (untested):

def _get_from_df(xchg_obj):
    lib = xchg_obj.__class__.__module__.split('.')[0]
    if lib == 'modin':
        from modin.utils import from_dataframe
    elif lib == 'pandas':
        from pandas.api.interchange import from_dataframe
    # add more branches for vaex and cudf
    else:
        raise RuntimeError(f'Unknown library: {lib}')
    return from_dataframe

vnlitvinov avatar Sep 16 '22 16:09 vnlitvinov

@ogrisel I'm trying to interpret the "subdataframe/view" part here, but I'm not 100% sure what you mean. The current protocol has a select_columns_by_name method, so that addresses the "indexing by columns names" need. Can you elaborate?

This indeed would be enough with the addition of a standard way to rebuild the dataframe object with a public from_dataframe factory.

ogrisel avatar Sep 19 '22 08:09 ogrisel

with the addition of a standard way to rebuild the dataframe object with a public from_dataframe factory.

Folks on the call last Thursday seemed to be happy with this idea, as long as this is a constructor that can be retrieved directly from the dataframe object in the interchange protocol - that would make it easier to reconstruct a dataframe from the correct library. In the absence of it, users are probably more likely to grab the Pandas from_dataframe function, which is less desirable for non-pandas input.

xref gh-42 for the signature of this constructor.

rgommers avatar Sep 19 '22 10:09 rgommers