seaborn icon indicating copy to clipboard operation
seaborn copied to clipboard

Palette does not support the use of defaultdict with missing values

Open ehermes opened this issue 1 year ago • 4 comments

Currently, Seaborn does not permit the use of defaultdict with missing values as a palette. A minimal example that reproduces this issue is:

import seaborn as sns
import pandas as pd
from collections import defaultdict

data = pd.DataFrame({
    "values": [1, 2, 3],
    "hues": ["foo", "bar", "baz"],
})

palette = defaultdict(lambda: "#000000", {
    "foo": "#ff0000",
    "bar": "#00ff00",
})

sns.histplot(
    x="values",
    data=data,
    hue="hues",
    palette=palette,
)

My expectation is that this should use the default value of #000000 for baz, which is missing from the palette. Instead, this raises an exception:

Traceback (most recent call last):
  File "/home/ehermes/test/seaborn_defaultdict.py", line 15, in <module>
    sns.histplot(
  File "/home/ehermes/venvs/seaborn/lib/python3.10/site-packages/seaborn/distributions.py", line 1384, in histplot
    p.map_hue(palette=palette, order=hue_order, norm=hue_norm)
  File "/home/ehermes/venvs/seaborn/lib/python3.10/site-packages/seaborn/_base.py", line 838, in map_hue
    mapping = HueMapping(self, palette, order, norm, saturation)
  File "/home/ehermes/venvs/seaborn/lib/python3.10/site-packages/seaborn/_base.py", line 150, in __init__
    levels, lookup_table = self.categorical_mapping(
  File "/home/ehermes/venvs/seaborn/lib/python3.10/site-packages/seaborn/_base.py", line 234, in categorical_mapping
    raise ValueError(err.format(missing))
ValueError: The palette dictionary is missing keys: {'baz'}

For this test, I have used seaborn-0.13.2 and matplotlib-3.8.2.

I have a fix for this problem in a personal branch (https://github.com/ehermes/seaborn/tree/palette_defaultdict), but per your contribution guidelines, I have opened a bug report first. With permission, I can also create a PR for my fix.

ehermes avatar Feb 07 '24 15:02 ehermes

defaultdict is a nice pythonic solution here, but the type signature for palette is already quite complicated and i'm fairly averse to expanding it further. I'm not also not convinced that setting up the defaultdict is that much more convenient than defining a full dict palette based on the data, e.g. something like

palette = {
    *{x: "k" for x in data["hues"].unique()},
    "foo": "#ff0000",
    "bar": "#00ff00",
}

Is the same LoC and avoids an import.

mwaskom avatar Feb 10 '24 16:02 mwaskom

This is a good solution if you have the data that you will be plotting when you are first creating the palette. In our application, the palette is "statically" defined in a library, and the data we plot is generated at runtime. Sometimes the data contains entries that we did not expect to be present at the time we wrote the library, so we need to have a backup value present. My current workaround to this issue is to essentially do what you're suggesting, but I have to do it in every single function that creates a seaborn plot, which is a lot of redundant code. We could possibly simplify things through a code re-org, but my preference would be for seaborn to use the defaultdict that we have chosen for this exact reason in the expected manner.

ehermes avatar Feb 10 '24 16:02 ehermes

Why say “in this expected manner”? Defaultdict is not a subtype of dict and seaborn’s docs don’t suggest that it will be accepted.

mwaskom avatar Feb 10 '24 17:02 mwaskom

Strictly speaking, defaultdict is a subtype of dict:

In [1]: from collections import defaultdict

In [2]: palette = defaultdict(lambda: "#000000", {
   ...:     "foo": "#ff0000",
   ...:     "bar": "#00ff00",
   ...: })

In [3]: isinstance(palette, dict)
Out[3]: True

When I say "in the expected manner", I mean from the "duck typing" perspective: a defaultdict behaves like a dict, and thus should be suitable for any application in which a dict is accepted. The only reason we cannot use a defaultdict as the palette for seaborn is because of an extra check that every level has a corresponding key in it, which may not be true for non-primitive dict-likes. Actually, this brings to mind an alternative possible solution, which doesn't specifically require reference to defaultdict:

if isinstance(palette, dict):
    missing = set()
    for level in levels:
        try:
            palette[level]
        except KeyError:
            missing.add(level)
    if any(missing):
        err = "The palette dictionary is missing keys: {}"
        raise ValueError(err.format(missing))

Edit: Removed non-functional alternate suggestions (apparently defaultdict.get doesn't behave the way I thought it did)

In any case, my point is that the current check is preventing us from using something as the palette which we would otherwise be able to, and which we currently do use for our other non-matplotlib plots (namely plotly). The changes I have suggested here would add more flexibility to the code without impacting the functionality of the missing key check, when users are passing a standard dict.

ehermes avatar Feb 10 '24 18:02 ehermes