seaborn icon indicating copy to clipboard operation
seaborn copied to clipboard

Rethink approach to grouping in plot phase of Plot

Open mwaskom opened this issue 2 years ago • 0 comments

Currently we can declare that a Mark property should form groups at the property level. The reason we need to do this is that different marks behave differently: e.g. each line has the same values for all its properties and is added separately, while a scatterplot can mix multiple property values in a single artist. But I don't think we have or will encounter cases where only a subset of a marks properties should group, so it is cumbersome to have to set grouping= for every property.

Instead, I think this can be determined within Mark._plot by adding a parameter to the split_gen generator, where the mark passes in the properties that should be grouping.

This also touches on a broader issue which is that the current grouping is relatively inefficient (e.g. see #2881). Ideally, we would do scaling over all data points and then group, which can be faster. The challenge has been that we no longer have a dataframe after scaling. The main reason is that working with colors as rgba tuples / n x 4 arrays is difficult in the context of a dataframe ... you can stick the tuples in a series, but then it has an object dtype that propagates through to the numpy array and works poorly with matplotlib. A few options would be:

  • Implement our own groupby logic on the dict of arrays / lists that we have after scaling
  • Store rgba values as separate columns in the dataframe we build while scaling (we could perhaps use a differnet internal color representation to facilitate things like luminance properties)
  • Implement some kind of RGBA extension array that lets a Series hold a 2d data structure (is this possible? I am not sure)

mwaskom avatar Jul 11 '22 00:07 mwaskom