seaborn
seaborn copied to clipboard
boxenplot area scale calculation
The area
method for calculating the width of boxenplot letter-value boxes is:
'area': lambda h, i, k: (1 - 2**(-k + i - 2)) / h}
in https://github.com/mwaskom/seaborn/blob/master/seaborn/categorical.py#L1890
IIUC, in order for the area to be proportional to the percentage of data covered, as documented (https://github.com/mwaskom/seaborn/blob/master/seaborn/categorical.py#L2672), the formula should rather be:
'area': lambda h, i, k: (1 - 2**(-k + i - 1)) / h}
cc @MaozGelbart you've interacted with this code more recently and will be in a better position to say if this is a bug.
Thanks @louridas for reporting, @mwaskom for letting me know.
The tests covering this code part do not test against expected results so it's possible that this code part may be wrong. I didn't touch it in #2086 but I did reduce k
(the inferred number of boxes) by 1 so it may be that this code part requires a change as well.
However I do notice that scale='area'
as described in the docstring differs from its meaning in the r version of lvplot
. It is not discussed in the paper describing letter-value plots, so I can only guess that the original implementation (#661) meant to duplicate that. Quoting R lvplot documentation:
width.method : character, one of ’linear’ (default), ’area’, or ’height’. This parameter determines whether the width of the box for letter value LV(i) should be proportional to i (linear), proportional to $2^-i$ (height), or whether the area of the box should be proportional to $2^-i$ (area).
While in boxenplot
(master):
scale : {“exponential”, “linear”, “area”}, optional Method to use for the width of the letter value boxes. All give similar results visually. “linear” reduces the width by a constant linear factor, “exponential” uses the proportion of data not covered, “area” is proportional to the percentage of data covered.
It seems to me that if we want to keep consistency with the R version, the definition of scale='area'
should be similar to scale='exponential'
(its description could improve), with the addition that the entire box area is proportional to the data covered. I'd be +1 for such a change.
@louridas it might be helpful to know how you came about this. was there a plot that looked obviously wrong?
I also don't really understand this comment in the docstring:
All give similar results visually
Which seems demonstrably not true.
I checked a plot created by seaborn and the same one with R, and the results were visually very different. Then I investigated and I came upon seaborn's definition, which mathematically seemed strange to me.
For instance, if we have only two boxes, I would expect the ratio of the areas of the outer to the inner box to be (7/8) / (3/4) instead of (15/16) / (7/8), hence I raised the issue.
About the visual difference between the R output and the seaborn output, here is an example R code:
library(tidyverse)
library(lvplot)
p <- ggplot(mpg, aes(drv, hwy))
p <- p + geom_lv(k=8, aes(fill=stat(LV)), width.method='area') + coord_flip()
p
ggsave('mpg_R.pdf')
and this is what I came up as equivalent in seaborn. Note that I could not find how to use palette per boxenplot, so I had to recourse to low-lever axes fiddling---perhaps there is a better way.
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from matplotlib.colors import ListedColormap
sns.set(style="whitegrid")
mpg = sm.datasets.get_rdataset(package='ggplot2', dataname='mpg').data
ax = sns.boxenplot(x='hwy', y='drv', k_depth=8, scale='area', data=mpg)
boxes = ax.collections[1]
cmap = ListedColormap(sns.color_palette('Set2').as_hex())
ax.collections[1].set_cmap(cmap)
ax.collections[3].set_cmap(cmap)
ax.collections[5].set_cmap(cmap)
plt.savefig('mpg.pdf')
plt.show()
I must note that I do not believe the R output is very usable---the widths are too thin. There could be a way to make them wider, but I am not an R expert.
What is interesting, though, is that the areas of the boxes are not symmetrical left and right of the median; it appears that each "half-box" gets its own width.
Can you show a screenshot of the R output? I don’t have it installed.
For reference:
I can work out for a code that calculates the area as defined in R's lvplot
:
The docstring says one thing, but the code itself has a width calculation (for scale='area'
) method that involves height. I've used the following area definition the the plot above:
'area': lambda h, i, k: 2**(i - k) / h
One apparent difference is that seaborn defines a box as one level starting left the the previous box and ending right to it, but with R it has two different boxes, so their sizes differ (as @louridas noted above). Due to this the 'area' is larger in seaborn than in R and is expected to result with very thin boxes after a few levels. For example, for boxes that overlap one end with larger boxes (as 'r' orange box), this may look disproportionate (you need to imagine the box continuing behind the large green(?) box to be convinced that the area is half the area of the previous level).
As a side note, mpg
is probably not the best dataset to try boxenplot
with k_depth=8
as there aren't enough points to describe using such depth. See with boxenplot
defaults:
sns.boxenplot(x='hwy', y='drv', data=mpg, showfliers=False, order=['r', 'f', '4'])
sns.swarmplot(x='hwy', y='drv', data=mpg, color='.2', s=4, order=['r', 'f', '4'])
Is there a compelling reason to use area? Could we just deprecate it?
The one reason I can find is that the three methods are mentioned in the original publication:
http://dx.doi.org/10.1080/10618600.2017.1305277
where the area definition is "make the area of each box proportion to the number of points in it" and not what apparently is implemented in R.
The justification for the usefulness of area is given as:
Area-adjusted widths ensure that the overall area is close to one, or more precisely, if k letter values are shown, the overall area will be proportional to 1 − 2**(-(k+1)). This makes an area-adjusted letter-value plot a representation of the variable’s density. Side-by-side versions of these plots show conditional densities, i.e while they do not explicitly contain the number of values (sample size is shown indirectly by the number of letter values chosen), the boxes of corresponding letter values have the same size.
http://dx.doi.org/10.1080/10618600.2017.1305277
We would probably like to update the link referred from boxenplot
docstring to this one.
We would probably like to update the link referred from boxenplot docstring to this one.
I don't have access to this version of the paper.
I don't have access to this version of the paper.
I see that the paper is not publicly available, so I am not sure it is OK to post it right here. I can send you a copy of it if you want.
I can likely get it if I activate my insitutional VPN — my point was more that we should maintain the link to the freely-available preprint rather than a version in a closed access journal since many seaborn users aren't academics and wouldn't have a way to obtain access.
One version that I could find freely available is a "post"-print and says nothing about the three methods:
https://vita.had.co.nz/papers/letter-value-plot.pdf
What it does say, however, is:
Boxes with matching heights [widths] correspond to the same depths.
which runs contrary to R's implementation where the same depth can get two different widths, one on the left, one on the right.
The rationale for scale="area"
is
This makes an area-adjusted letter-value plot a representation of the variable’s density.
but in the seaborn implementation, that seems to be true of scale="exponential"
:
x = np.random.standard_t(15, size=50000)
sns.violinplot(x=x)
ax = sns.boxenplot(x=x, scale="exponential", color="r")
ax.collections[-1].set_alpha(.75)
To see the difference we must use the definition:
'area': lambda h, i, k: (1 - 2**(-k + i - 1)) / h}
which gives an area proportional to the percentage of observations. The definition proposed by @MaozGelbart:
'area': lambda h, i, k: 2**(i - k) / h
gives boxes with smaller and smaller area as we move outwards. Boxes should become bigger and bigger as we move outwards, if we want their area to correspond to covering more and more of the data.
Then the code:
x = np.random.standard_t(15, size=50000)
sns.violinplot(x=x)
ax = sns.boxenplot(x=x, scale="area", color="r")
ax.collections[-1].set_alpha(.75)
will produce:
I guess I don't see how those boxes represent the density in any interpretable way?
Correct.
They do not represent the density in an interpretable way. They ratio of the areas of two boxes corresponds to the ratio of the data covered by each of the corresponding letter values.
I am not a statistician, I am not the one to judge how useful this is.
But that's what the quote you shared from the paper claims:
Area-adjusted widths ensure that the overall area is close to one, or more precisely, if k letter values are shown, the overall area will be proportional to 1 − 2**(-(k+1)). This makes an area-adjusted letter-value plot a representation of the variable’s density.
So I am confused.
You are right. I got confused, sorry.
I think the best way to see the difference of what it is supposed to mean is to check with an exponential distribution. There it becomes more obvious what R tries to achieve by working separately with left and right boxes:
While this is what we get with "tapering" boxes in seaborn:
x = np.random.exponential(1, size=10000)
sns.violinplot(x=x)
ax = sns.boxenplot(x=x, scale="area", color="r")
ax.collections[-1].set_alpha(.75)
I think this is how the equivalent could be in seaborn:
I applied some hacks in _lvplot()
to get there, in effect creating the two set of boxes like in R. If you would like and it helps I could put up the code somewhere or make a pull request.
Hello again, just to take it out of my mind, do you have any thoughts on the issue? Do you need my hack/fix, or do you consider doing something else?
Hello,
I'm digging up this issue, but when comparing "area" scaling to the original paper's representation (Figure 3C, https://doi.org/10.1080/10618600.2017.1305277), it seems that the Seaborn implementation is still incorrect.
Seaborn (random uniform):
Expected result from the paper :
For me the "area" scaling is the "correct" way of doing boxenplots as it is directly representative of the underlying PDF of the studied variable, but it still allows easy reading of quantiles and differences between multiple categories. It is a bit of an hybrid between an histogram and a boxplot.