seaborn icon indicating copy to clipboard operation
seaborn copied to clipboard

boxenplot area scale calculation

Open louridas opened this issue 3 years ago • 24 comments

The area method for calculating the width of boxenplot letter-value boxes is:

'area': lambda h, i, k: (1 - 2**(-k + i - 2)) / h}

in https://github.com/mwaskom/seaborn/blob/master/seaborn/categorical.py#L1890

IIUC, in order for the area to be proportional to the percentage of data covered, as documented (https://github.com/mwaskom/seaborn/blob/master/seaborn/categorical.py#L2672), the formula should rather be:

'area': lambda h, i, k: (1 - 2**(-k + i - 1)) / h}

louridas avatar Aug 29 '20 11:08 louridas

cc @MaozGelbart you've interacted with this code more recently and will be in a better position to say if this is a bug.

mwaskom avatar Aug 29 '20 13:08 mwaskom

Thanks @louridas for reporting, @mwaskom for letting me know.

The tests covering this code part do not test against expected results so it's possible that this code part may be wrong. I didn't touch it in #2086 but I did reduce k (the inferred number of boxes) by 1 so it may be that this code part requires a change as well.

However I do notice that scale='area' as described in the docstring differs from its meaning in the r version of lvplot. It is not discussed in the paper describing letter-value plots, so I can only guess that the original implementation (#661) meant to duplicate that. Quoting R lvplot documentation:

width.method : character, one of ’linear’ (default), ’area’, or ’height’. This parameter determines whether the width of the box for letter value LV(i) should be proportional to i (linear), proportional to $2^-i$ (height), or whether the area of the box should be proportional to $2^-i$ (area).

While in boxenplot(master):

scale : {“exponential”, “linear”, “area”}, optional Method to use for the width of the letter value boxes. All give similar results visually. “linear” reduces the width by a constant linear factor, “exponential” uses the proportion of data not covered, “area” is proportional to the percentage of data covered.

It seems to me that if we want to keep consistency with the R version, the definition of scale='area' should be similar to scale='exponential' (its description could improve), with the addition that the entire box area is proportional to the data covered. I'd be +1 for such a change.

MaozGelbart avatar Aug 29 '20 18:08 MaozGelbart

@louridas it might be helpful to know how you came about this. was there a plot that looked obviously wrong?

mwaskom avatar Aug 30 '20 00:08 mwaskom

I also don't really understand this comment in the docstring:

All give similar results visually

Which seems demonstrably not true.

mwaskom avatar Aug 30 '20 00:08 mwaskom

I checked a plot created by seaborn and the same one with R, and the results were visually very different. Then I investigated and I came upon seaborn's definition, which mathematically seemed strange to me.

For instance, if we have only two boxes, I would expect the ratio of the areas of the outer to the inner box to be (7/8) / (3/4) instead of (15/16) / (7/8), hence I raised the issue.

About the visual difference between the R output and the seaborn output, here is an example R code:

library(tidyverse)
library(lvplot)
p <- ggplot(mpg, aes(drv, hwy))
p <- p + geom_lv(k=8, aes(fill=stat(LV)), width.method='area') + coord_flip()
p
ggsave('mpg_R.pdf')

and this is what I came up as equivalent in seaborn. Note that I could not find how to use palette per boxenplot, so I had to recourse to low-lever axes fiddling---perhaps there is a better way.

import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from matplotlib.colors import ListedColormap
sns.set(style="whitegrid")
mpg = sm.datasets.get_rdataset(package='ggplot2', dataname='mpg').data
ax = sns.boxenplot(x='hwy', y='drv', k_depth=8, scale='area', data=mpg)
boxes = ax.collections[1]
cmap = ListedColormap(sns.color_palette('Set2').as_hex())
ax.collections[1].set_cmap(cmap)
ax.collections[3].set_cmap(cmap)
ax.collections[5].set_cmap(cmap)
plt.savefig('mpg.pdf')
plt.show()

I must note that I do not believe the R output is very usable---the widths are too thin. There could be a way to make them wider, but I am not an R expert.

What is interesting, though, is that the areas of the boxes are not symmetrical left and right of the median; it appears that each "half-box" gets its own width.

louridas avatar Aug 30 '20 09:08 louridas

Can you show a screenshot of the R output? I don’t have it installed.

mwaskom avatar Aug 30 '20 13:08 mwaskom

Sure here it is:

mpg_R.pdf

louridas avatar Aug 30 '20 13:08 louridas

For reference:

image

mwaskom avatar Aug 30 '20 15:08 mwaskom

I can work out for a code that calculates the area as defined in R's lvplot: image

The docstring says one thing, but the code itself has a width calculation (for scale='area') method that involves height. I've used the following area definition the the plot above:

'area': lambda h, i, k: 2**(i - k) / h

One apparent difference is that seaborn defines a box as one level starting left the the previous box and ending right to it, but with R it has two different boxes, so their sizes differ (as @louridas noted above). Due to this the 'area' is larger in seaborn than in R and is expected to result with very thin boxes after a few levels. For example, for boxes that overlap one end with larger boxes (as 'r' orange box), this may look disproportionate (you need to imagine the box continuing behind the large green(?) box to be convinced that the area is half the area of the previous level).

As a side note, mpg is probably not the best dataset to try boxenplot with k_depth=8 as there aren't enough points to describe using such depth. See with boxenplot defaults:

sns.boxenplot(x='hwy', y='drv', data=mpg, showfliers=False, order=['r', 'f', '4'])
sns.swarmplot(x='hwy', y='drv', data=mpg, color='.2', s=4, order=['r', 'f', '4'])

image

MaozGelbart avatar Aug 30 '20 19:08 MaozGelbart

Is there a compelling reason to use area? Could we just deprecate it?

mwaskom avatar Aug 30 '20 19:08 mwaskom

The one reason I can find is that the three methods are mentioned in the original publication:

http://dx.doi.org/10.1080/10618600.2017.1305277

where the area definition is "make the area of each box proportion to the number of points in it" and not what apparently is implemented in R.

The justification for the usefulness of area is given as:

Area-adjusted widths ensure that the overall area is close to one, or more precisely, if k letter values are shown, the overall area will be proportional to 1 − 2**(-(k+1)). This makes an area-adjusted letter-value plot a representation of the variable’s density. Side-by-side versions of these plots show conditional densities, i.e while they do not explicitly contain the number of values (sample size is shown indirectly by the number of letter values chosen), the boxes of corresponding letter values have the same size.

louridas avatar Aug 30 '20 20:08 louridas

http://dx.doi.org/10.1080/10618600.2017.1305277

We would probably like to update the link referred from boxenplot docstring to this one.

MaozGelbart avatar Aug 30 '20 21:08 MaozGelbart

We would probably like to update the link referred from boxenplot docstring to this one.

I don't have access to this version of the paper.

mwaskom avatar Aug 30 '20 21:08 mwaskom

I don't have access to this version of the paper.

I see that the paper is not publicly available, so I am not sure it is OK to post it right here. I can send you a copy of it if you want.

louridas avatar Aug 31 '20 04:08 louridas

I can likely get it if I activate my insitutional VPN — my point was more that we should maintain the link to the freely-available preprint rather than a version in a closed access journal since many seaborn users aren't academics and wouldn't have a way to obtain access.

mwaskom avatar Aug 31 '20 13:08 mwaskom

One version that I could find freely available is a "post"-print and says nothing about the three methods:

https://vita.had.co.nz/papers/letter-value-plot.pdf

What it does say, however, is:

Boxes with matching heights [widths] correspond to the same depths.

which runs contrary to R's implementation where the same depth can get two different widths, one on the left, one on the right.

louridas avatar Aug 31 '20 14:08 louridas

The rationale for scale="area" is

This makes an area-adjusted letter-value plot a representation of the variable’s density.

but in the seaborn implementation, that seems to be true of scale="exponential":

x = np.random.standard_t(15, size=50000)
sns.violinplot(x=x)
ax = sns.boxenplot(x=x, scale="exponential", color="r")
ax.collections[-1].set_alpha(.75)

image

mwaskom avatar Aug 31 '20 17:08 mwaskom

To see the difference we must use the definition:

'area': lambda h, i, k: (1 - 2**(-k + i - 1)) / h}

which gives an area proportional to the percentage of observations. The definition proposed by @MaozGelbart:

'area': lambda h, i, k: 2**(i - k) / h

gives boxes with smaller and smaller area as we move outwards. Boxes should become bigger and bigger as we move outwards, if we want their area to correspond to covering more and more of the data.

Then the code:

x = np.random.standard_t(15, size=50000)
sns.violinplot(x=x)
ax = sns.boxenplot(x=x, scale="area", color="r")
ax.collections[-1].set_alpha(.75)

will produce:

standard_t

louridas avatar Aug 31 '20 19:08 louridas

I guess I don't see how those boxes represent the density in any interpretable way?

mwaskom avatar Aug 31 '20 19:08 mwaskom

Correct.

They do not represent the density in an interpretable way. They ratio of the areas of two boxes corresponds to the ratio of the data covered by each of the corresponding letter values.

I am not a statistician, I am not the one to judge how useful this is.

louridas avatar Aug 31 '20 20:08 louridas

But that's what the quote you shared from the paper claims:

Area-adjusted widths ensure that the overall area is close to one, or more precisely, if k letter values are shown, the overall area will be proportional to 1 − 2**(-(k+1)). This makes an area-adjusted letter-value plot a representation of the variable’s density.

So I am confused.

mwaskom avatar Aug 31 '20 20:08 mwaskom

You are right. I got confused, sorry.

I think the best way to see the difference of what it is supposed to mean is to check with an exponential distribution. There it becomes more obvious what R tries to achieve by working separately with left and right boxes:

exponential

While this is what we get with "tapering" boxes in seaborn:

x = np.random.exponential(1, size=10000)
sns.violinplot(x=x)
ax = sns.boxenplot(x=x, scale="area", color="r")
ax.collections[-1].set_alpha(.75)

exponential_seaborn

louridas avatar Sep 01 '20 14:09 louridas

I think this is how the equivalent could be in seaborn:

area_seaborn

I applied some hacks in _lvplot() to get there, in effect creating the two set of boxes like in R. If you would like and it helps I could put up the code somewhere or make a pull request.

louridas avatar Sep 01 '20 20:09 louridas

Hello again, just to take it out of my mind, do you have any thoughts on the issue? Do you need my hack/fix, or do you consider doing something else?

louridas avatar Oct 14 '20 14:10 louridas

Hello, I'm digging up this issue, but when comparing "area" scaling to the original paper's representation (Figure 3C, https://doi.org/10.1080/10618600.2017.1305277), it seems that the Seaborn implementation is still incorrect. Seaborn (random uniform): Screenshot from 2022-11-09 17-13-50 Expected result from the paper : Screenshot from 2022-11-09 17-15-09

For me the "area" scaling is the "correct" way of doing boxenplots as it is directly representative of the underlying PDF of the studied variable, but it still allows easy reading of quantiles and differences between multiple categories. It is a bit of an hybrid between an histogram and a boxplot.

pierreDELANGEN avatar Nov 09 '22 16:11 pierreDELANGEN