ggplot2 icon indicating copy to clipboard operation
ggplot2 copied to clipboard

Drop plot_env from ggplot2 objects?

Open clauswilke opened this issue 4 years ago • 22 comments

This comment https://github.com/tidyverse/ggplot2/issues/3619#issuecomment-628021555 prompted me to look into plot_env, and as far as I can see it's stored and handed around just so it can be eventually given to the function combine_vars(), which then doesn't use it: https://github.com/tidyverse/ggplot2/blob/e9b9946786dae861dea1c352f2ac7b2a837b5f82/R/facet-.r#L544-L587 Is this a hold-over from the times before tidy eval? Can we remove this/or assign NULL here? https://github.com/tidyverse/ggplot2/blob/5a686c34b304ddefeee16d44362f326cfa29634e/R/plot.r#L96

It is already marked as deprecated in the documentation: https://github.com/tidyverse/ggplot2/blob/5a686c34b304ddefeee16d44362f326cfa29634e/R/plot.r#L35

clauswilke avatar May 13 '20 15:05 clauswilke

Setting plot_env to NULL doesn't work, which means it is still used somewhere as an environment. However, setting it to an empty environment seems to work without issues.

library(ggplot2)

p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_wrap(~Species)

p$plot_env <- NULL
p
#> Error in exists(name, envir = env, mode = mode): use of NULL environment is defunct

p$plot_env <- rlang::new_environment()
p

Created on 2020-05-13 by the reprex package (v0.3.0)

It looks like the other place where the environment is used is for scale lookup. Maybe it would be better to use the global environment that is active when the plot is printed, rather than when the plot is created? That would be more predictable anyways, I assume. https://github.com/tidyverse/ggplot2/blob/5a686c34b304ddefeee16d44362f326cfa29634e/R/scale-type.R#L28-L39

clauswilke avatar May 13 '20 15:05 clauswilke

On the last point, just setting plot_env = rlang::new_environment() means we cannot globally change color scales anymore. So we'd have to fix this appropriately.

library(ggplot2)

p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_wrap(~Species)

scale_colour_discrete <- scale_colour_viridis_d
p


p$plot_env <- rlang::new_environment()
p

Created on 2020-05-13 by the reprex package (v0.3.0)

clauswilke avatar May 13 '20 15:05 clauswilke

Thanks for investigating this. I didn't know the mechanism around plot_env.

At first, ggplot2 only looks up in global environment, but started to check the package environment by this commit:

https://github.com/tidyverse/ggplot2/commit/a9f5e0a4016a58d612cf286291efe427f1bf6c9f

Then, find_global() was changed to use the parent environment instead of global environment by this commit:

https://github.com/tidyverse/ggplot2/commit/64666abc4626ab93033601c264a1747778c39dab

I agree it's nice if we can get rid of plot_env, but the fix might be tricky if we want to preserve the current behavior. For example, I'm afraid looking up only in global environment would break this:

library(ggplot2)

do_plot <- function() {
  p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
    geom_point() +
    facet_wrap(~Species)
  
  scale_colour_discrete <- scale_colour_viridis_d
  
  p
}

p <- do_plot()
p


p$plot_env <- rlang::new_environment()
p

Created on 2020-05-14 by the reprex package (v0.3.0)

yutannihilation avatar May 14 '20 12:05 yutannihilation

I get that, but the question is do we want to encourage this usage pattern? Does anybody use it? Is it even logical? See the next example. Would you have expected the plot to use the brewer color scale?

library(ggplot2)

do_plot <- function() {
  p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
    geom_point() +
    facet_wrap(~Species)
  
  scale_colour_discrete <- scale_colour_viridis_d
  
  p
}

p <- do_plot()

scale_colour_discrete <- scale_colour_brewer
p

Created on 2020-05-14 by the reprex package (v0.3.0)

My sense is that using the environment that is active when the plot is being printed is totally reasonable, and that doesn't require saving the environment in the plot itself. I also feel that if people write functions that return plots, then they should explicitly add the scales they want.

clauswilke avatar May 14 '20 14:05 clauswilke

do we want to encourage this usage pattern? Does anybody use it?

I really don't know, sorry. I didn't even know (or completely forget) that we can override the default scale in such a way... I think your claim is 99% valid, but I just don't see the context why ggplot2 needed to be changed to use parent environment instead of global environment.

yutannihilation avatar May 15 '20 03:05 yutannihilation

By the way, is this related to #2691? If we can set the default scale by setting the default theme, probably we don't need to tweak globals (or some environments) anymore.

yutannihilation avatar May 15 '20 03:05 yutannihilation

@hadley Do you remember why scale lookup in the parent environment was added? https://github.com/tidyverse/ggplot2/commit/64666abc4626ab93033601c264a1747778c39dab Would it be terrible to get rid of this feature?

clauswilke avatar May 15 '20 18:05 clauswilke

I have a vague recollection it was something to do with scale customisation. Maybe so that you could write your scale_colour_continuous() to override the defaults?

hadley avatar May 15 '20 18:05 hadley

I’m pretty sure that was the reason. I’m all for a better way of allowing users to set a default scale

thomasp85 avatar May 15 '20 19:05 thomasp85

Sure, but the question is: Is it sufficient to look either in the global environment or the parent environment of the print() call, or do we need to capture the local environment that is active when the plot is defined?

In other words, in the reprex in https://github.com/tidyverse/ggplot2/issues/3994#issuecomment-628683281, would it be fine if the color scale being used upon printing was the brewer scale rather than the viridis scale?

clauswilke avatar May 15 '20 19:05 clauswilke

I think if we went ahead and nuked the plot_env we might as well come up with a better solution for setting scale defaults than looking for specifically named functions

thomasp85 avatar May 15 '20 21:05 thomasp85

we might as well come up with a better solution for setting scale defaults

That creates a whole new set of problems, though. (https://github.com/tidyverse/ggplot2/issues/3962#issuecomment-621873544) I should probably create an issue just for that. We will need some mechanism to override breaks, labels, expansion, etc. without adding a new scale function.

clauswilke avatar May 15 '20 22:05 clauswilke

After implementing the p$plot_env <- rlang::new_environment() quick fix, the issue seemed resolved. However, after a recent update (or, possibly, a result of adopting the best-practices suggested in https://ggplot2.tidyverse.org/dev/articles/ggplot2-in-packages.html) I'm finding that other components of a ggplot seem to hold copies of environment objects. Consequently, saving a small list of ggplots with saveRDS or qs::qsave writes >100Gb of data to disk in my case. I've used butcher::weigh to better understand object sizes in the ggplot object: -

> butcher::weigh(p)
# A tibble: 137 x 2
   object                          size
   <chr>                          <dbl>
 1 mapping.x                 3639.     
 2 mapping.y                 3639.     
 3 layers2                   3639.     
 4 scales                       0.727  
 5 layers1                      0.574  
 6 coordinates                  0.255  
 7 facet                        0.178  
 8 data.clusters                0.00404
 9 theme.text.margin            0.00076
10 theme.axis.title.x.margin    0.00076
# … with 127 more rows
> 
> p$layers
[[1]]
geom_bar: width = NULL, na.rm = FALSE, orientation = NA
stat_identity: na.rm = FALSE
position_stack 

[[2]]
mapping: ymin = ~mean - se, ymax = ~mean + se 
geom_errorbar: na.rm = FALSE, orientation = NA, width = 0.2
stat_identity: na.rm = FALSE
position_dodge 

> p$mapping$x
<quosure>
expr: ^.data[["clusters"]]
env:  0x561f534a4500
> p$mapping$y
<quosure>
expr: ^mean
env:  0x561f534a4500
> 

It appears that quosures within the mapping and layers elements are responsible. Has there been a recent change?
Each quosure environment contains large objects unrelated to the plot or quosure expression. Any suggestions? A simple ggplot of data from a 10 row / 3 column tibble is now occupying ~12Gb on disk.

combiz avatar Jun 09 '20 18:06 combiz

It appears that quosures within the mapping and layers elements are responsible. Has there been a recent change?

I don't think so. Quosures capture environments in nature, and layers are environment itself.

@clauswilke Let me clarify one thing, is the motivation of this issue to clean up the unused internals, not to enable saving plots on disk, right?

yutannihilation avatar Jun 10 '20 09:06 yutannihilation

Let me clarify one thing, is the motivation of this issue to clean up the unused internals, not to enable saving plots on disk, right?

That's correct. I think saving plots on disk is a fundamentally flawed workflow. I don't think it will ever work properly, and I don't see any good reason to do so in the first place.

clauswilke avatar Jun 10 '20 14:06 clauswilke

Thanks, I totally agree with you.

yutannihilation avatar Jun 10 '20 15:06 yutannihilation

Btw, as https://github.com/tidyverse/ggplot2/pull/3828 gets merged, I guess overriding scale_colour_discrete on the global environment is now superseded. Maybe we can take these two steps:

  1. Warn when scale_colour_discrete is found on different environment than ggplot2's namespace, and encourage users to use options(ggplot2.discrete.fill).
  2. Just drop plot_env.

yutannihilation avatar Jul 26 '20 08:07 yutannihilation

Just to be clear: I think there are two separate issues here. One is how to globally set color scales. The other is whether any scales (not just color) should be searched in the environment that was active when the plot was defined or in the current global environment. I think the latter choice (search in the current global environment, thus keeping a copy of the environment is not needed) is fine.

clauswilke avatar Jul 26 '20 17:07 clauswilke

not just color

I see. Sorry for my confusion.

yutannihilation avatar Jul 27 '20 03:07 yutannihilation

Let me clarify one thing, is the motivation of this issue to clean up the unused internals, not to enable saving plots on disk, right?

That's correct. I think saving plots on disk is a fundamentally flawed workflow. I don't think it will ever work properly, and I don't see any good reason to do so in the first place.

One use case is to avoid having to redraw large and slow plots. If I can saveRDS the ggplot list object say then I can see whether it has changed and whether I need to redraw the entire plot. Also I can track this with git and avoid having to add large plots to my git repo. However, if I understand the above discussion, it's not that simple, since the entire R environment is potentially accessed during drawing; changes to the R env may or may not affect the drawn plot; not all the information is contained in the ggplot list.

woodwards avatar Jan 05 '22 23:01 woodwards

I think saving plots on disk is a fundamentally flawed workflow [...] and I don't see any good reason to do so

The use case we handle with this is a Shiny app that allows users to download a plot in a format such that it can easily be "reprinted" later. Different scientific journals may have different requirements regarding dimensions, fonts, etc., and we believe storing an rds file that contains both the plot and the data is good way to have both in one file while being able to modify some aspects later (do correct me if that is not correct, or if there are better alternatives). Other options (bitmaps, PDF files, ...) are all pretty complex to postprocess.

We certainly improved things by replacing plot_env by a new environment, but replacing mapping and layers by something meaningful turns out to be more difficult. Any ideas on this one?

bersbersbers avatar May 30 '22 22:05 bersbersbers

@bersbersbers this only works by coincidence and not by design. We make no guarantees about the ggplot2 internals and making it possible replaying saved plots from previous versions of ggplot2 would be a huge amount of work that we do not plan to do.

hadley avatar May 31 '22 13:05 hadley