dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

from_dataframe: only interchange supported columns?

Open MarcoGorelli opened this issue 2 years ago • 9 comments

I just wanted to raise this here https://github.com/mwaskom/seaborn/issues/3533

Try to plot a polars dataframe with a list column raises, even if that column wouldn't have been plotted

Arguably this the issue is that seaborn is trying to convert all columns, rather than only the ones it needs

But still, wondering if there's a solution here - like, that from_dataframe only converts columns if they're of a supported data type, rather than raising. Maybe we could add an option for this in the pandas from_dataframe method, and note that it's technically not part of the protocol

MarcoGorelli avatar Oct 20 '23 18:10 MarcoGorelli

In my opinion, that seems to be working as expected given we don't yet support data types like LIST.

Selecting the relevant columns to interchange seems like something that should either be done using the library specific API or using the API standard we're working on?

kkraus14 avatar Oct 20 '23 18:10 kkraus14

selecting columns can be done with __dataframe__, no need for the dataframe api for this

in theory it is working as expected, it's just a pity to see a library introduce a regression when they move from to_pandas() to using the interchange protocol

MarcoGorelli avatar Oct 20 '23 18:10 MarcoGorelli

selecting columns can be done with __dataframe__, no need for the dataframe api for this

You can't select columns in the actual __dataframe__ call, but on the object returned from calling __dataframe__. I don't think it's well defined behavior to call __dataframe__ against a DataFrame with unsupported column dtypes?

kkraus14 avatar Oct 20 '23 19:10 kkraus14

I would also say that this is the responsibility of the user of the interchange protocol (i.e. in this case seaborn) to first select the columns it needs before converting (which is also useful for other reasons, eg because the conversion can be costly).

I think silently dropping columns in the conversion can also be confusing / lead to corner cases. Assume someone passes a polars dataframe to seaborn and specifies x="my_column" in the seaborn plotting method. If accidentally this column is not one supported by the interchange protocol, and it would be dropped in the conversion, then seaborn would raise a confusing error about "my_column" not being a column name, instead of a type error about the dtype not being supported.

jorisvandenbossche avatar Oct 20 '23 19:10 jorisvandenbossche

You can't select columns in the actual __dataframe__ call, but on the object returned from calling __dataframe__. I don't think it's well defined behavior to call __dataframe__ against a DataFrame with unsupported column dtypes?

Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types. In any case that's how all the implementations I am aware of work, I think.

jorisvandenbossche avatar Oct 20 '23 19:10 jorisvandenbossche

Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types. In any case that's how all the implementations I am aware of work, I think.

In that case, say you have a library specific dataframe with an unsupported dtype like LIST in a column. When would it throw?

  1. mydf.__dataframe__(...)
  2. mydf.__dataframe__(...).get_column(...)
  3. mydf.__dataframe__(...).get_column(...).dtype

kkraus14 avatar Oct 20 '23 19:10 kkraus14

I'd suggest that both get_column(...) and get_column(...).dtype have to throw (in the same place, namely inside get_column).

Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types.

I think we indeed didn't define anything beyond "nested data types are unsupported". And I suspect Joris is right that it'd be better to allow this - failing later will make it easier to write more generic code here.

rgommers avatar Oct 26 '23 16:10 rgommers

__dataframe__ is already being used here under the assumption that it doesn't fail so please let's not make that raise:

https://github.com/plotly/plotly.py/blob/57e4d1d33c67c5cc715bec5c3c240dd6f4c3b10d/packages/python/plotly/plotly/express/_core.py#L1428-L1431

                columns = list(necessary_columns)
                df = pd.api.interchange.from_dataframe(
                    df.__dataframe__().select_columns_by_name(columns)
                )

MarcoGorelli avatar Oct 26 '23 16:10 MarcoGorelli

I think throwing on unsupported dtype columns in the get_column(...) call is reasonable behavior and we should probably document this as a supported case.

kkraus14 avatar Oct 27 '23 00:10 kkraus14