from_dataframe: only interchange supported columns?
I just wanted to raise this here https://github.com/mwaskom/seaborn/issues/3533
Try to plot a polars dataframe with a list column raises, even if that column wouldn't have been plotted
Arguably this the issue is that seaborn is trying to convert all columns, rather than only the ones it needs
But still, wondering if there's a solution here - like, that from_dataframe only converts columns if they're of a supported data type, rather than raising. Maybe we could add an option for this in the pandas from_dataframe method, and note that it's technically not part of the protocol
In my opinion, that seems to be working as expected given we don't yet support data types like LIST.
Selecting the relevant columns to interchange seems like something that should either be done using the library specific API or using the API standard we're working on?
selecting columns can be done with __dataframe__, no need for the dataframe api for this
in theory it is working as expected, it's just a pity to see a library introduce a regression when they move from to_pandas() to using the interchange protocol
selecting columns can be done with
__dataframe__, no need for the dataframe api for this
You can't select columns in the actual __dataframe__ call, but on the object returned from calling __dataframe__. I don't think it's well defined behavior to call __dataframe__ against a DataFrame with unsupported column dtypes?
I would also say that this is the responsibility of the user of the interchange protocol (i.e. in this case seaborn) to first select the columns it needs before converting (which is also useful for other reasons, eg because the conversion can be costly).
I think silently dropping columns in the conversion can also be confusing / lead to corner cases. Assume someone passes a polars dataframe to seaborn and specifies x="my_column" in the seaborn plotting method. If accidentally this column is not one supported by the interchange protocol, and it would be dropped in the conversion, then seaborn would raise a confusing error about "my_column" not being a column name, instead of a type error about the dtype not being supported.
You can't select columns in the actual
__dataframe__call, but on the object returned from calling__dataframe__. I don't think it's well defined behavior to call__dataframe__against a DataFrame with unsupported column dtypes?
Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types. In any case that's how all the implementations I am aware of work, I think.
Maybe we didn't really define it (and it would be good to do so), but my assumption is that
__dataframe__itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types. In any case that's how all the implementations I am aware of work, I think.
In that case, say you have a library specific dataframe with an unsupported dtype like LIST in a column. When would it throw?
-
mydf.__dataframe__(...) -
mydf.__dataframe__(...).get_column(...) -
mydf.__dataframe__(...).get_column(...).dtype
I'd suggest that both get_column(...) and get_column(...).dtype have to throw (in the same place, namely inside get_column).
Maybe we didn't really define it (and it would be good to do so), but my assumption is that
__dataframe__itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types.
I think we indeed didn't define anything beyond "nested data types are unsupported". And I suspect Joris is right that it'd be better to allow this - failing later will make it easier to write more generic code here.
__dataframe__ is already being used here under the assumption that it doesn't fail so please let's not make that raise:
https://github.com/plotly/plotly.py/blob/57e4d1d33c67c5cc715bec5c3c240dd6f4c3b10d/packages/python/plotly/plotly/express/_core.py#L1428-L1431
columns = list(necessary_columns)
df = pd.api.interchange.from_dataframe(
df.__dataframe__().select_columns_by_name(columns)
)
I think throwing on unsupported dtype columns in the get_column(...) call is reasonable behavior and we should probably document this as a supported case.