ggplot2
ggplot2 copied to clipboard
Requirement of dplyr for non-tibble tbls
Due to the fortify method for <tbl> objects, {dplyr} is require for plotting with non-tibble <tbl> classes: https://github.com/tidyverse/ggplot2/blob/0e64d9c56ccc8db31971723810c3c10f0a67d9e4/R/fortify.r#L19-L22
I view {pillar} and the <tbl> subclass as mainly for formatting so wondered if it would be possible to simply dispatch to the next data.frame method and have explicit methods for other <tbl> sub-classes that need supporting separately.
I've also raised this in the pillar repo to get their opinion of how best to view the
Hope this all makes sense. Example below (on install where dplyr is removed)
library(tibble)
library(ggplot2)
x <- y <- 1:2
tbl <- dat <- data.frame(x, y)
class(tbl) <- c("tbl", "data.frame")
tbl
#> # A data frame: 2 × 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 2 2
# this errors
ggplot(tbl, aes(x,y)) + geom_col()
#> Error: dplyr must be installed to work with tbl objects
# this will print ok
ggplot(as_tibble(dat), aes(x,y)) + geom_col()

Created on 2022-03-31 by the reprex package (v2.0.1)
The standard class for data.frames in tibble is tbl_df which do not require dplyr to handle:
> class(tibble::as_tibble(mtcars))
[1] "tbl_df" "tbl" "data.frame"
https://github.com/tidyverse/ggplot2/blob/ae7fb41d33fd629cddbea484fc37b5f02dea3c41/R/fortify.r#L17 The code you are referring to is in place to allow support for the different backends such as databases etc which will require dplyr to handle correctly.
@thomasp85 I think my feeling was that these different back ends should have explicit support, rather than catch everything via the <tbl> class. Alternatively, and perhaps more reasonably, the emphasis could/should be on those backends to provide an as.data.frame method which {ggplot2} would then call for the fortify.tbl() method.
Im not sure I understand the issue. The tbl class will require dplyr, unless we are using the tbl_df subclass. What do you want to achieve by having the dependency of dplyr by hidden in pillar?
Apologies it's always tricky to communicate in issues. I'm not asking for the dependency of dplyr to be hidden in pillar. I'll try and break down what I mean:
- As far as I understand the intent of {pillar}, there is nothing about an object of class
c("tbl", "data.frame")that should require it to be treated any different than a data frame. This seems to be supported by how it is utilised in the compat-vctrs.R file of rlang for instance (https://github.com/r-lib/rlang/blob/36b3c2146e10f9314e7bb77987015f2f9fe73ab3/R/compat-vctrs.R#L25-L30). - If we take the above point as correct then it would make sense for
fortify.tblto just call thefortify.data.framemethod. - Because ggplot is trying to support known use cases of the non-tibble <tbl> subclasses (I guess predominantly the database ones) it is calling
dplyr::collapse()(knowing that these packages have defined a collapse method). - I'd argue that this is not ideal for ggplot2 as it forces you to have knowledge of these external uses and how these objects are behaving. If you did want to target them I'd have thought being explicit would be better (perhaps a
fortify.tbl_sql()method is sufficient). - If you did want to protect against non-data.frame type uses for <tbl>s I think it would be better to call
as.data.frame()on them and put the burden on to the other packages to provide that method (In fact {dbplyr} already provides a method like this which wrapsdplyr::collect()).
Note this only came up when I realised that in order to use {pillar} formatting in a vignette which also utilised {ggplot2} I needed to add {dplyr} as an additional suggested dependency. Nothing major but thought it felt a little odd.
Hope all of this makes sense and no worries if you want to keep the issue closed.
Best
Hmmm... it is my clear understanding that tbl is a virtual class and you shouldn't really have objects that aren't subclasses of it. If you want to make a special data.frame like object you should subclass tbl_df.
Can you point me to the vignette where you are experiencing this behaviour?
@hadley have I misunderstood the class structure?
If you want to make a special data.frame like object you should subclass tbl_df.
pillar advertises making data.frame like objects as a subclass of <tbl> for the formatting benefits.
Can you point me to the vignette where you are experiencing this behaviour?
I'm afraid it's an internal (work) facing package but I'm not sure it adds to the discussion either way.
I'll put another example below that hopefully illustrate the role I thought {pillar} plays and why I'd like it to work in {ggplot} without the need for {dplyr}. Again I've just temporarily uninstalled {dplyr}.
library(pillar)
library(ggplot2)
# create a new book_tbl class to record books
my_library <- vctrs::new_data_frame(
list(Book = c("Advanced R", "R packages", "Mort"),
Author = c("H.Wickham", "H.Wickham", "T.Pratchett"),
Category = c("Non-fiction", "Non-fiction", "Fiction")),
class = c("books_tbl", "tbl")
)
# use pillar to customise the printing
tbl_sum.books_tbl <- function(x, ...) {
c("My library" = sprintf("%d books", nrow(x)))
}
# print the nice looking library
my_library
#> # My library: 3 books
#> Book Author Category
#> <chr> <chr> <chr>
#> 1 Advanced R H.Wickham Non-fiction
#> 2 R packages H.Wickham Non-fiction
#> 3 Mort T.Pratchett Fiction
# this will error
ggplot(my_library, aes(Category)) + geom_bar()
#> Error: dplyr must be installed to work with tbl objects
# this is fine
class(my_library) <- "data.frame"
ggplot(my_library, aes(Category)) + geom_bar()

Created on 2022-04-01 by the reprex package (v2.0.1)
I'll await Hadley's comments on this, but IMO this is not expected usage. If you simply want a data.frame with nice formatting you create one with tibble, but I may be wrong in that regard. pillar is not a user facing package and I wouldn't read its documentation with the eyes of an end user.
to clarify - I am using pillar as a package developer not as an end user ... the examples above are just illustrations as to why.
If you did want to protect against non-data.frame type uses for
s I think it would be better to call as.data.frame()on them and put the burden on to the other packages to provide that method (In fact {dbplyr} already provides a method like this which wrapsdplyr::collect()).
This sounds good to me.
Let me reopen the issue as it seems the discussion will continue at least for a while.
I think we can probably change this, but someone will need to carefully explore which tbl subclasses currently rely on fortify.tbl and figure out what methods will be needed to replace them.
Hi all, is there a risk of an user accidentally importing a large amount of data from databases or Spark? I fear that someone may pipe into, or use their remote tbl in ggplot() and not realize that it means that all of the data will be collected into R at that time