ggplot2
ggplot2 copied to clipboard
Mismatched fill colour with geom_dotplot
When using geom_dotplot
and specifying a fill
aesthetic, we get a different output depending on whether fill
is specified inside the function aes
or as a separate vector. In the example below, I would expect all points below x~12 to be the same colour in both graphs.
library(tidyverse)
library(patchwork)
# Use 'dose' inside aes
gg1 <- ggplot(ToothGrowth, aes(x = len)) +
geom_dotplot(aes(fill = factor(dose)),
binwidth = .5) +
guides(fill = "none") # Remove legend
# Create separate vector for fill
dot_cols <- as.integer(factor(ToothGrowth$dose))
gg2 <- ggplot(ToothGrowth, aes(x = len)) +
geom_dotplot(fill = dot_cols,
binwidth = .5)
gg1 + gg2
Created on 2022-04-28 by the reprex package (v2.0.1)
As we can see, on the right, there are about 5 points between x=10 and x=12 that should be black but are actually red.
As far as I can tell, this could be a consequence of the data being sorted with respect to x, which means that the rows of the dataset no longer match the elements of the vector dot_cols
. As a proof of concept, I recreated the second graph but I sorted the data first (I also used the group
aesthetic to try to get as close as possible to the original graph):
library(tidyverse)
# Repeat last graph but reorder data first
ToothGrowth2 <- arrange(ToothGrowth, len)
dot_cols2 <- as.integer(factor(ToothGrowth2$dose))
ggplot(ToothGrowth, aes(x = len, group = factor(dose))) +
geom_dotplot(fill = dot_cols2,
binwidth = .5)
Created on 2022-04-28 by the reprex package (v2.0.1)
Now all points below x~12 are black, as expected.
Just realized I forgot to add my session info:
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.0 (2022-04-22)
#> os macOS Monterey 12.3.1
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_CA.UTF-8
#> ctype en_CA.UTF-8
#> tz America/Edmonton
#> date 2022-04-28
#> pandoc 2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#> broom 0.8.0 2022-04-13 [1] CRAN (R 4.2.0)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
#> cli 3.2.0 2022-02-14 [1] CRAN (R 4.2.0)
#> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
#> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
#> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0)
#> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.2.0)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
#> dplyr * 1.0.8 2022-02-08 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> generics 0.1.2 2022-01-31 [1] CRAN (R 4.2.0)
#> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0)
#> haven 2.5.0 2022-04-15 [1] CRAN (R 4.2.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#> hms 1.1.1 2021-09-26 [1] CRAN (R 4.2.0)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.2.0)
#> jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0)
#> knitr 1.38 2022-03-25 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0)
#> lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.2.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
#> patchwork * 1.1.1 2020-12-17 [1] CRAN (R 4.2.0)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.2.0)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.0)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.2.0)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.2.0)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> readr * 2.1.2 2022-01-30 [1] CRAN (R 4.2.0)
#> readxl 1.4.0 2022-03-28 [1] CRAN (R 4.2.0)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0)
#> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0)
#> rmarkdown 2.13 2022-03-10 [1] CRAN (R 4.2.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0)
#> rvest 1.0.2 2021-10-16 [1] CRAN (R 4.2.0)
#> scales 1.2.0 2022-04-13 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.2.0)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0)
#> tibble * 3.1.6 2021-11-07 [1] CRAN (R 4.2.0)
#> tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.2.0)
#> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0)
#> tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.2.0)
#> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.30 2022-03-02 [1] CRAN (R 4.2.0)
#> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
#> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Created on 2022-04-28 by the reprex package (v2.0.1)
When you specify a column inside aes()
, the values are scaled and mapped to a color map, and when outside they aren't.
This sort of question is a better fit for RStudio Community. Please ask there with a minimal reproducible example. We use this issue tracker for bug reports and feature requests.
@yutannihilation I'm sorry, I think you misunderstood the point of my post. So let me be more explicit: if I pass a vector to fill
where the n-th entry is "black"
(or 1
), I expect the dot corresponding to the n-th observation to be black. But as my reprex shows, this is not the case.
In other words, I'm reporting what I believe to be a bug in ggplot2
.
Ah, sorry I missed your point. I think you are right, but I believe passing multiple values outside aes()
is not a recommended usage anyway.
As far as I can tell, this could be a consequence of the data being sorted with respect to x, which means that the rows of the dataset no longer match the elements of the vector
dot_cols
.
I'm reopening this because there might be some room for better warnings or documents.
you can define a new column in your dataset not as an outside vector and use it. you can use color values and use scale_fill_identity()
What is exactly your use case and why you need/want an outside vector ? of course using an outside vector means that you will take ownership of the ordering as opposed to having it part of the data and then sorting is applied to the whole data.
library(tidyverse)
library(patchwork)
ToothGrowth$dot_cols <- ifelse(ToothGrowth$dose=="0.5", "red","blue")
ggplot(ToothGrowth, aes(x = len)) +
geom_dotplot(aes(fill = dot_cols))+
scale_fill_identity()+
facet_grid(~dose)
#> Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.
Created on 2022-04-29 by the reprex package (v2.0.1)
@smouksassi It's not that I have a personal use case. I usually either use a column for the fill
aesthetic, or a vector of length 1, i.e. fill = 'black'
.
The "use case" would be more pedagogical. A student actually found the reported bug when writing code they thought was correct. Indeed, using geom_point
and the colour
aesthetic, the code would behave as expected. I try to teach best practices, but sometimes students don't follow these guidelines.
From my perspective, and the perspective of my students, I think one of the following three solutions would work:
- When the dataset is sorted, so are the aesthetics.
- Error when passing a vector of length greater than 1 as a fill or colour aesthetic.
- Warning when passing a vector of length greater than 1 as a fill or colour aesthetic aesthetic.
I can certainly provide a pull request for one of the last two options if you want.
Thanks, I think we all are agreeing on this option (I'd like to implement this as a general one, not limited to geom_dotplot()
, not limited to colour
or fill
).
Warning when passing a vector of length greater than 1
Note that, we cannot and should not expect how the data gets aggregated and sorted. It's usually obvious when stat
is not identity
, but StatBindot
, which is used in dot plot. is a rare statistical computation in that the length of the computed data is the same as the original (if I understand correctly), so I understand one may fall into a wrong sense...
Would it make sense to add the warning in the function check_aesthetics
? Or would that be too general?
https://github.com/tidyverse/ggplot2/blob/a6c22e7185c3a895111ed97f54278958c9389255/R/geom-.r#L210-L222
In the if
statement and before the return
statement, there could be a check if (any(ns == n))
, and if the condition is TRUE
, the warning would be triggered, telling the user that it is not recommended and can lead to undesired behaviour.
that would be too general as the function is also used for cases where that is allowed - better to put the warning in use_defaults()
...
But before we go down that road we need to make sure that there are no legitimate case for passing aesthetics as parameters with length > 1... Team, can you weigh in?
I feel we've been at this point before. For some reason people love to use this feature (passing in data as parameters rather than via aes()
), and most of the time it works, and sometimes it doesn't, and sometimes it actually solves a real problem. My solution would be to leave the code as is but state very prominently, maybe at multiple places in the documentation, that this practice is not encouraged and may not lead to expected results.
most of the time it works, and sometimes it doesn't, and sometimes it actually solves a real problem
I don't remember well, but I get this point. However, I think almost all of those cases are stat_identity()
. Can we warn only when the stat
is not identity
? Geom$use_default()
cannot refer to the layer's information, so it looks difficult, though.
I think there are legitimate use-cases for providing a vector of values outside aes()
. Mostly, you'd want to bypass mapping these values to a scale. I'd also expect that these values are parallel to the input data and follow recycling rules (1 or nrow(data)
, otherwise error). I you want to prevent passing aesthetics as vectors outside aes()
, I can get behind that, if there is an alternate way of bypassing scales.
For example, if this is my preferred outcome for a plot:
library(ggplot2)
df <- data.frame(
x = 1:2, y = 1:2,
colour = c("blue", "red"),
group = c("A", "B")
)
ggplot(df, aes(x, y)) +
geom_tile(aes(colour = group), fill = NA, linewidth = 2) +
geom_point(colour = df$colour)
Then I want a way to pass values verbatim as (unmapped) colour aesthetic, without bothering with a scale. Normally you'd use scale_*_identity()
to do that, but that doesn't work when another layer does use the scale for that aesthetic.
ggplot(df, aes(x, y)) +
geom_tile(aes(colour = group), fill = NA, linewidth = 2) +
geom_point(aes(colour = I(colour)))
Created on 2023-01-04 by the reprex package (v2.0.1)
Maybe, all scales should just ignore AsIs
class objects in their training, mapping, and censoring, but that discussion might be worth separating from this one.
@teunbrand can you open a new issue to discuss special treatment of I()
input?
As for this, providing a vector of values outside of aes()
is always undefined behaviour. The only reason why we don't throw an error is because it would break too many old plots. Allowing it for some stats would only lead to more confusion IMO
About the following point:
StatBindot, which is used in dot plot. is a rare statistical computation in that the length of the computed data is the same as the original
This is sort of true and false at the same time. The stats computes a summary per bin, which the geom expands again into invidual dots based on the count per bin. So while yes, the number of points in the output is the same as the input, there is no reasonable way to preserve non-constant information about individual observations.