ggplot2 icon indicating copy to clipboard operation
ggplot2 copied to clipboard

Mismatched fill colour with geom_dotplot

Open turgeonmaxime opened this issue 2 years ago • 11 comments

When using geom_dotplot and specifying a fill aesthetic, we get a different output depending on whether fill is specified inside the function aes or as a separate vector. In the example below, I would expect all points below x~12 to be the same colour in both graphs.

library(tidyverse)
library(patchwork)

# Use 'dose' inside aes
gg1 <- ggplot(ToothGrowth, aes(x = len)) +
    geom_dotplot(aes(fill = factor(dose)),
                 binwidth = .5) +
    guides(fill = "none") # Remove legend

# Create separate vector for fill
dot_cols <- as.integer(factor(ToothGrowth$dose))
gg2 <- ggplot(ToothGrowth, aes(x = len)) +
    geom_dotplot(fill = dot_cols,
                 binwidth = .5)

gg1 + gg2

Created on 2022-04-28 by the reprex package (v2.0.1)

As we can see, on the right, there are about 5 points between x=10 and x=12 that should be black but are actually red.

As far as I can tell, this could be a consequence of the data being sorted with respect to x, which means that the rows of the dataset no longer match the elements of the vector dot_cols. As a proof of concept, I recreated the second graph but I sorted the data first (I also used the group aesthetic to try to get as close as possible to the original graph):

library(tidyverse)

# Repeat last graph but reorder data first
ToothGrowth2 <- arrange(ToothGrowth, len)
dot_cols2 <- as.integer(factor(ToothGrowth2$dose))
ggplot(ToothGrowth, aes(x = len, group = factor(dose))) +
    geom_dotplot(fill = dot_cols2, 
                 binwidth = .5)

Created on 2022-04-28 by the reprex package (v2.0.1)

Now all points below x~12 are black, as expected.

turgeonmaxime avatar Apr 28 '22 21:04 turgeonmaxime

Just realized I forgot to add my session info:

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22)
#>  os       macOS Monterey 12.3.1
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_CA.UTF-8
#>  ctype    en_CA.UTF-8
#>  tz       America/Edmonton
#>  date     2022-04-28
#>  pandoc   2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom         0.8.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli           3.2.0   2022-02-14 [1] CRAN (R 4.2.0)
#>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.2.0)
#>  dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.0.8   2022-02-08 [1] CRAN (R 4.2.0)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.3.5   2021-06-25 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.2.0)
#>  haven         2.5.0   2022-04-15 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.2.0)
#>  httr          1.4.2   2020-07-20 [1] CRAN (R 4.2.0)
#>  jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr         1.38    2022-03-25 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  patchwork   * 1.1.1   2020-12-17 [1] CRAN (R 4.2.0)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.2.0)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.2.0)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.2.0)
#>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.2   2022-01-30 [1] CRAN (R 4.2.0)
#>  readxl        1.4.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  rvest         1.0.2   2021-10-16 [1] CRAN (R 4.2.0)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  styler        1.7.0   2022-03-13 [1] CRAN (R 4.2.0)
#>  tibble      * 3.1.6   2021-11-07 [1] CRAN (R 4.2.0)
#>  tidyr       * 1.2.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.0)
#>  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.2.0)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.30    2022-03-02 [1] CRAN (R 4.2.0)
#>  xml2          1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Created on 2022-04-28 by the reprex package (v2.0.1)

turgeonmaxime avatar Apr 29 '22 00:04 turgeonmaxime

When you specify a column inside aes(), the values are scaled and mapped to a color map, and when outside they aren't.

This sort of question is a better fit for RStudio Community. Please ask there with a minimal reproducible example. We use this issue tracker for bug reports and feature requests.

yutannihilation avatar Apr 29 '22 02:04 yutannihilation

@yutannihilation I'm sorry, I think you misunderstood the point of my post. So let me be more explicit: if I pass a vector to fill where the n-th entry is "black" (or 1), I expect the dot corresponding to the n-th observation to be black. But as my reprex shows, this is not the case.

In other words, I'm reporting what I believe to be a bug in ggplot2.

turgeonmaxime avatar Apr 29 '22 02:04 turgeonmaxime

Ah, sorry I missed your point. I think you are right, but I believe passing multiple values outside aes() is not a recommended usage anyway.

As far as I can tell, this could be a consequence of the data being sorted with respect to x, which means that the rows of the dataset no longer match the elements of the vector dot_cols.

I'm reopening this because there might be some room for better warnings or documents.

yutannihilation avatar Apr 29 '22 02:04 yutannihilation

you can define a new column in your dataset not as an outside vector and use it. you can use color values and use scale_fill_identity()

What is exactly your use case and why you need/want an outside vector ? of course using an outside vector means that you will take ownership of the ordering as opposed to having it part of the data and then sorting is applied to the whole data.

library(tidyverse)
library(patchwork)

ToothGrowth$dot_cols <- ifelse(ToothGrowth$dose=="0.5", "red","blue")


ggplot(ToothGrowth, aes(x = len)) +
  geom_dotplot(aes(fill = dot_cols))+
  scale_fill_identity()+
  facet_grid(~dose)
#> Bin width defaults to 1/30 of the range of the data. Pick better value with `binwidth`.

Created on 2022-04-29 by the reprex package (v2.0.1)

smouksassi avatar Apr 29 '22 16:04 smouksassi

@smouksassi It's not that I have a personal use case. I usually either use a column for the fill aesthetic, or a vector of length 1, i.e. fill = 'black'.

The "use case" would be more pedagogical. A student actually found the reported bug when writing code they thought was correct. Indeed, using geom_point and the colour aesthetic, the code would behave as expected. I try to teach best practices, but sometimes students don't follow these guidelines.

From my perspective, and the perspective of my students, I think one of the following three solutions would work:

  • When the dataset is sorted, so are the aesthetics.
  • Error when passing a vector of length greater than 1 as a fill or colour aesthetic.
  • Warning when passing a vector of length greater than 1 as a fill or colour aesthetic aesthetic.

I can certainly provide a pull request for one of the last two options if you want.

turgeonmaxime avatar Apr 29 '22 16:04 turgeonmaxime

Thanks, I think we all are agreeing on this option (I'd like to implement this as a general one, not limited to geom_dotplot(), not limited to colour or fill).

Warning when passing a vector of length greater than 1

Note that, we cannot and should not expect how the data gets aggregated and sorted. It's usually obvious when stat is not identity, but StatBindot, which is used in dot plot. is a rare statistical computation in that the length of the computed data is the same as the original (if I understand correctly), so I understand one may fall into a wrong sense...

yutannihilation avatar Apr 30 '22 00:04 yutannihilation

Would it make sense to add the warning in the function check_aesthetics? Or would that be too general? https://github.com/tidyverse/ggplot2/blob/a6c22e7185c3a895111ed97f54278958c9389255/R/geom-.r#L210-L222

In the if statement and before the return statement, there could be a check if (any(ns == n)), and if the condition is TRUE, the warning would be triggered, telling the user that it is not recommended and can lead to undesired behaviour.

turgeonmaxime avatar May 05 '22 19:05 turgeonmaxime

that would be too general as the function is also used for cases where that is allowed - better to put the warning in use_defaults()...

But before we go down that road we need to make sure that there are no legitimate case for passing aesthetics as parameters with length > 1... Team, can you weigh in?

thomasp85 avatar May 12 '22 08:05 thomasp85

I feel we've been at this point before. For some reason people love to use this feature (passing in data as parameters rather than via aes()), and most of the time it works, and sometimes it doesn't, and sometimes it actually solves a real problem. My solution would be to leave the code as is but state very prominently, maybe at multiple places in the documentation, that this practice is not encouraged and may not lead to expected results.

clauswilke avatar May 12 '22 13:05 clauswilke

most of the time it works, and sometimes it doesn't, and sometimes it actually solves a real problem

I don't remember well, but I get this point. However, I think almost all of those cases are stat_identity(). Can we warn only when the stat is not identity? Geom$use_default() cannot refer to the layer's information, so it looks difficult, though.

yutannihilation avatar May 12 '22 14:05 yutannihilation

I think there are legitimate use-cases for providing a vector of values outside aes(). Mostly, you'd want to bypass mapping these values to a scale. I'd also expect that these values are parallel to the input data and follow recycling rules (1 or nrow(data), otherwise error). I you want to prevent passing aesthetics as vectors outside aes(), I can get behind that, if there is an alternate way of bypassing scales.

For example, if this is my preferred outcome for a plot:

library(ggplot2)
df <- data.frame(
  x = 1:2, y = 1:2, 
  colour = c("blue", "red"),
  group = c("A", "B")
)

ggplot(df, aes(x, y)) +
  geom_tile(aes(colour = group), fill = NA, linewidth = 2) +
  geom_point(colour = df$colour)

Then I want a way to pass values verbatim as (unmapped) colour aesthetic, without bothering with a scale. Normally you'd use scale_*_identity() to do that, but that doesn't work when another layer does use the scale for that aesthetic.

ggplot(df, aes(x, y)) +
  geom_tile(aes(colour = group), fill = NA, linewidth = 2) +
  geom_point(aes(colour = I(colour)))

Created on 2023-01-04 by the reprex package (v2.0.1)

Maybe, all scales should just ignore AsIs class objects in their training, mapping, and censoring, but that discussion might be worth separating from this one.

teunbrand avatar Jan 04 '23 16:01 teunbrand

@teunbrand can you open a new issue to discuss special treatment of I() input?

As for this, providing a vector of values outside of aes() is always undefined behaviour. The only reason why we don't throw an error is because it would break too many old plots. Allowing it for some stats would only lead to more confusion IMO

thomasp85 avatar Jan 09 '23 08:01 thomasp85

About the following point:

StatBindot, which is used in dot plot. is a rare statistical computation in that the length of the computed data is the same as the original

This is sort of true and false at the same time. The stats computes a summary per bin, which the geom expands again into invidual dots based on the count per bin. So while yes, the number of points in the output is the same as the input, there is no reasonable way to preserve non-constant information about individual observations.

teunbrand avatar Mar 15 '24 11:03 teunbrand