seaborn icon indicating copy to clipboard operation
seaborn copied to clipboard

Refactoring and enhancing categorical plots

Open mwaskom opened this issue 4 years ago • 0 comments

This issue serves to track the work involved in refactoring the categorical module to use the core infrastructure, modernizing the API, features, and implementation of the plotting functions.

Background

The categorical plots have module-specific code for processing variables, semantic mapping, and statistical estimation. This uses a distinct format for internal data representation. For 0.12, this module will be refactored to use the objects in seaborn._core and seaborn._statistics.

The plan for the first phase of work is a mostly straight refactor with minimal API changes and only new features that emerge from what the core code offers, hence bringing the functions into consistency with the rest of the library. Once complete, it will then be easier to add new features to these functions.

Additionally, this module predated any support for categorical data in matplotlib. Seaborn internally managed the mapping from categorical levels to ordinal numeric values. Part of the refactor will be to use matplotlib's unit machinery to handle that mapping.

Changes to default API/behavior

The API disruption from this work will be substantially less than with the distribution plots refactor in 0.11, because the categorical module was the first real iteration of the seaborn API. But some aspects of the current default behavior are inconsistent with the library in a few ways:

  • Currently, a color palette will be applied to levels of the categorical (x or y) variable in the absence of a hue specification. That will no longer be true; it will be necessary to explicitly assign hue to see multiple colors. (This behavior is likely to be changed without warning. A warning would be disruptive and annoying. But it could be possible to change the default value for hue to some sentinel object and ask the user to set hue=None to accept the new behavior or to set hue to the same value as the categorical axis to preserve the old behavior).
  • Treatment of "flat" data (a 1D vector passed to data with no assignment of x or y variables) will change. In 0.12, each position in the vector will become a categorical group (using index information if present), rather than aggregating over the value as if they were a single group (to get that behavior, simply pass the vector to x or y).
  • Functions that dodge by default will be changed to have dodge="auto", which will dodge only when there are multiple levels of the hue variable for each level of the categorical variable. (This behavior will not be deprecated with a warning, because the current default behavior is a source of confusion and probably rarely desired).
  • Currently, assigning hue and specifying color will generate a light or dark (depending on the function) palette. This hack will be removed; to get the same effect explicitly pass palette="dark:{color}" or palette="light:{color}". (This behavior can be deprecated with a warning).
  • The plots will now follow the default property cycle; i.e. calling an axes-level function multiple times with the same active axes will produce artists with different colors.
  • Numeric hue variables will be handled numerically (currently they are always treated categorically) and get a sequential default color mapping, consistent with the relational and distribution modules.

New features

  • There will be a parameter (fixed_scale) to disable the forced categorical mapping, meaning that the placement along the "categorical" axis will respect the numeric values of that variable. Fixed scaling will remain the default behavior.
  • It will be possible to specify only one of x or y and add a hue mapping.
  • Additional options for error bars in barplot/pointplot (following #2407)
  • More control over the legend
  • All functions should accept and use label= as part of generally-improved kwarg-passing
  • Better default string-ification of numeric and datetime categories and control over the conversion using formatter.
  • Possibly adding an option that sends the hue variables to either the face or edge color of the artists.
  • Possibly adding additional semantic mapping variables, akin to those in the relational module.
  • Using the matplotlib units system will provide better results in an interactive application (e.g., the cursor coordinates readout now has the categorical value)
  • Mixing functions from the categorical module with other functions in the same Axes will have fewer surprising results, as the functions will use the same ordering rules on the categorical axis

PR tracker

  • [x] stripplot (#2413)
  • [x] swarmplot (#2447, see also #2443)
  • [ ] boxplot
  • [ ] boxenplot
  • [ ] kdeplot
  • [ ] pointplot
  • [ ] barplot
  • [ ] catplot

mwaskom avatar Jan 11 '21 12:01 mwaskom