seaborn icon indicating copy to clipboard operation
seaborn copied to clipboard

boxplot crashes when data is a Series that doesn't contain an index 0

Open jhncls opened this issue 2 years ago • 7 comments

Seaborn versions: 11.2 and dev

boxplot crashes when data is a pandas Series with a numeric index that doesn't contain 0. To reproduce:

import seaborn as sns
import pandas as pd

df = pd.DataFrame({'val': [1, 2]}, index=[1, 2])
sns.boxplot(data=df['val'])

This gives a crash at line 447 in categorical.py which checks if np.isscalar(data[0]) with data = df['val'].

A simple workaround is to give df['val'].values to data.

jhncls avatar Mar 10 '22 11:03 jhncls

boxplot isn't documented as accepting a Series for the data parameter, although it could probably be made to work somehow...

mwaskom avatar Mar 10 '22 12:03 mwaskom

The documentation suggests that it might work. And it seems to work with Series with a categorical index or one that contains zero.

A code change in function establish_variables() (line 437) might be:

            elif isinstance(data, pd.Series):
                plot_data = [np.asarray(data, float)]
                group_names = [data.name]

jhncls avatar Mar 10 '22 13:03 jhncls

The documentation suggests that it might work

Can you point me to where? I believe you but the parameter docs for boxplot specifically say "data: DataFrame, array, or list of arrays".

One complication is that some other categorical plots (e.g. barplot, pointplot) do handle Series "wide-form" inputs although they treat them as if they were passed to y and draw a single group, but I don't think that's very helpful ... a non-aggregated bar/point plot of values against index (like what lineplot gives you) feels like the expected behavior there. But that would feel wrong for box/violinplot.

mwaskom avatar Mar 10 '22 13:03 mwaskom

The documentation writes "Input data can be passed in a variety of formats, including:", and then gives an enumeration. Although the enumeration explicitly tells that a Series is supposed to go as x, y or hue, an occasional user might suppose that it also works as data. Especially as also An array or list of vectors is mentioned.

I don't really have a suggestion of how to write the documentation differently. Seaborn is open to a huge variety of formats, and usually comes up with something that a casual user would expect. It is amazing what is happening behind the scene to make everything seem so natural.

My guess is that sns.boxplot(data=df['val']) and sns.boxplot(data=df['val'].values) should lead to the same plot (except for the extraction of the Series' name). This seems to be congruent with treating a Series similar to a "wide" DataFrame with only one column.

The idea to treat 1D data for barplot and pointplot similar to lineplot, and thus different from boxplot, countplot and their other relatives, also seems interesting. But, it would make sns.barplot(data=df) behave differently depending on whether df has one versus more numeric columns. So, at first sight an intuitive idea, at second sight adding to the complexity, both for the internal code as for explaining it.

jhncls avatar Mar 10 '22 16:03 jhncls

But, it would make sns.barplot(data=df) behave differently depending on whether df has one versus more numeric columns

No, a dataframe with one column is still a dataframe / 2D structure, so that would receive the same treatment (as would a 2D array where one dimension has size = 1). It would only apply to data structures that seaborn considers “vectors”, ie sequences of scalars

mwaskom avatar Mar 11 '22 13:03 mwaskom

OK. That makes sense. Today, I was surprised that the main difference between sns.kdeplot(data=np.random.rand(10)) and sns.kdeplot(data=np.random.rand(10,1)) is that the 1D version uses the color= parameter, ignoring the palette, while the 2D version does the opposite. It makes sense, of course, but it still surprises.

Adding all this info to the documentation might overwhelm the user. It is hard to find the sweet spot between too much and not enough explanation. Sometimes adding a warning might help, although such might also add to the confusion when figure-level functions get involved. All in all, I think you're tackling this in a superb way.

Maybe apart from the general "Choosing color palettes", also a more in-depth page about the myriad of options for the palette= parameter, and how it interacts with the many options for the data= could be helpful.

jhncls avatar Mar 11 '22 17:03 jhncls

It makes sense, of course, but it still surprises.

IMO the root issue here is that tidy/long-form data is better than "wide-form" data, where semantics need to be inferred from the dimensions of the data object. When using the latter, users need to be more aware about the details of how their data object is structured, and they need to know the (implicit) mappings between dimension and plot variable that each function will use. But, still, lots of people aren't comfortable with the table transformations needed to put their data in long-form, and find wide-form data handy.

also a more in-depth page about the myriad of options for the palette= parameter, and how it interacts with the many options for the data= could be helpful

I actually don't think there's much interaction here? What matter is that sometimes wide-form data is treated as implicitly having a hue variable (e.g. columns of a 2D array or dataframe), in which case the palette kwarg takes precedence over color. So there are pages on data structures and palettes, but it's not obvious to me what one would need to say about the relationship between them.

mwaskom avatar Mar 12 '22 17:03 mwaskom