ggplot2
ggplot2 copied to clipboard
Allow functions in `breaks` argument to `stat_bin()`
Allowing breaks in stat_bin() to accept functions would be useful in the context of grouping structures.
The changes would also enable facet histograms for estimators with variable bin width, e.g, equal area of all blocks, or the "essential histogram" (doi:10.1093/biomet/asz081). The same effect (with facet_*) is of course already possible using geom_rect().
The required code additions would be minimal and follow the same idea as #1890
A simple illustration how this could work is given below:
df <- data.frame(
x = c(rexp(1000), rexp(1000, 5)),
study = c(rep("A", 1000), rep("B", 1000))
)
ggplot(df, aes(x)) +
geom_histogram(aes(y = ..density..), breaks = function(x) qexp(c(0, .25, .5, .75, .95, .99), 1/mean(x))) +
facet_wrap(vars(study))

A proposed implementation was submitted here: #4560
@aijordan Since you already made a PR there's no need to close it unless you feel it has a problem or discussion/review arrives at the conclusion that it's not workable. But feature requests should always be discussed in an issue first, as the request may be reasonable even if the particular implementation doesn't pass review.
This feature request might also be a nice opportunity to allow rlang lambda notation for not only the breaks argument, but allow it for the binwidth argument too.
@teunbrand Could you provide an example of what this would do and how it would work?
Sure; it won't work for breaks because that is the proposed feature I don't have on my machine, but here are some examples of how I imagine what it would do for the binwidth argument. It's really just a syntactic nicety.
This already allowed (the function itself isn't very exciting):
library(ggplot2)
df <- data.frame(
x = c(rexp(1000), rexp(1000, 5)),
study = c(rep("A", 1000), rep("B", 1000))
)
ggplot(df, aes(x)) +
geom_histogram(binwidth = function(x) {diff(range(x)/20)})

It is currently not allowed to write the function to the binwidth argument as a rlang lambda function. In other places in ggplot2, you can use rlang lambda functions, such as scales or the data argument in layers.
ggplot(df, aes(x)) +
geom_histogram(binwidth = ~ diff(range(.x))/20)
#> Warning: Computation failed in `stat_bin()`:
#> `width` must be a numeric scalar
ggplot2 already has an internal function to convert formulas to functions when appropriate, introduced in v3.3.4.
ggplot(df, aes(x)) +
geom_histogram(binwidth = ggplot2:::allow_lambda(~ diff(range(.x))/20))

Created on 2021-07-22 by the reprex package (v1.0.0)
I'd propose to apply the allow_lambda() function somewhere in the stat instead of it being user supplied, which needn't be very complicated. For example, in the constructor of stat_bin() or in a ggproto method, you might add lines like the following :
binwidth <- allow_lambda(binwidth)
breaks <- allow_lambda(breaks)
Thanks, if there is interest in this, I could reopen PR #4560 with included rlang lambda function notation support for both breaks and binwidth.
I feel I should point out that function input for breaks (and bindwidth already) does not play nice with the default position = "stack", but since a warning is issued that is probably fine!?
library(ggplot2)
df <- data.frame(
x = c(rexp(1000), rexp(1000, 5)),
study = c(rep("A", 1000), rep("B", 1000))
)
ggplot(df, aes(x)) +
geom_histogram(
mapping = aes(fill = study),
binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)),
boundary = 0)
#> Warning: position_stack requires non-overlapping x intervals

Another issue with unequal bin widths arises when using geom_freqpoly, where a wide bin at the tail can add excessive padding. This already happens currently when specifying breaks, and therefore probably belongs in its own issue if at all.
ggplot(df, aes(x)) + geom_freqpoly(aes(y = ..density..), bins = 20)

ggplot(df, aes(x)) +
geom_freqpoly(
mapping = aes(y = ..density..),
breaks = c(0, 0.1, 0.3, 0.6, 1, 1.5, 3, max(df$x)))

It seems that padding is not optional for geom_freqpoly and users can work around this case by using stat_bin(geom = GeomPath) with custom padding when specifying breaks.
ggplot(df, aes(x)) +
stat_bin(
mapping = aes(y = ..density..),
geom = GeomPath,
breaks = c(-0.01, 0, 0.1, 0.3, 0.6, 1, 1.5, 3, max(df$x), max(df$x) + 0.01))

Created on 2021-07-23 by the reprex package (v2.0.0)
@aijordan a PR with those changes would be great thanks! Do you also want to look into the freqpoly padding issue?
Seems to me that a quick "solution" would be to have geom_freqpoly respect a pad argument in ... (the documentation states that these get passed on to the respective layer)
That is, change the following if-condition in geom_freqpoly, where padding is enforced, to identical(stat, "bin") && !exists("pad", params)
params <- list2(na.rm = na.rm, ...)
if (identical(stat, "bin")) {
params$pad <- TRUE
}
Then, custom padding via the breaks argument would be possible, as above.
I guess anything more than this quick fix probably requires changes to bin_vector (in /R/bin.R).
A small change would reduce the perceived padding in general: The function bin_out() in /R/bin.R returns a data frame with variables xmin and xmax. As far as I can tell, these are never used explicitly (GeomBar calculates these from the width variable).
However, they seem to be used implicitly somewhere to determine the range of x values that are (invisibly) displayed. GeomPath and GeomPoint only use the x variable (the bin midpoints) that is present in the data frame to display data, so we get unexpectedly large padding to the left and right. For a small width of the left- and rightmost bins, this issue is almost imperceivable. Is it the intended behavior that the entire value range of every bin should always be displayed, even when the geom does not use the full range for display?
Here are two of the previous examples, when xmin and xmax are removed from the output of bin_out() (Unfortunately, I did not set a seed before, so the actual data is different):
library(ggplot2)
set.seed(20220617)
df <- data.frame(
x = c(rexp(1000), rexp(1000, 5)),
study = c(rep("A", 1000), rep("B", 1000))
)
ggplot(df, aes(x)) + geom_freqpoly(aes(y = ..density..), bins = 20)

ggplot(df, aes(x)) +
geom_freqpoly(
mapping = aes(y = ..density..),
breaks = c(0, 0.1, 0.3, 0.6, 1, 1.5, 3, max(df$x)))

Created on 2022-06-17 by the reprex package (v2.0.1)
@clauswilke Do you prefer a single or two separate pull requests, or should the padding be discussed in a new issue anyway? There is the additional question, whether geom_freqpoly() and geom_histogram() should pass on the pad argument to stat_bin() instead of overwriting/ignoring it. Also, it seems that I wasn't quick enough and cannot revive the old PR since its target was the "master" branch, which has since been deprecated.