ggplot2 icon indicating copy to clipboard operation
ggplot2 copied to clipboard

Requirement of dplyr for non-tibble tbls

Open TimTaylor opened this issue 2 years ago • 11 comments

Due to the fortify method for <tbl> objects, {dplyr} is require for plotting with non-tibble <tbl> classes: https://github.com/tidyverse/ggplot2/blob/0e64d9c56ccc8db31971723810c3c10f0a67d9e4/R/fortify.r#L19-L22

I view {pillar} and the <tbl> subclass as mainly for formatting so wondered if it would be possible to simply dispatch to the next data.frame method and have explicit methods for other <tbl> sub-classes that need supporting separately.

I've also raised this in the pillar repo to get their opinion of how best to view the class.

Hope this all makes sense. Example below (on install where dplyr is removed)

library(tibble)
library(ggplot2)

x <- y <- 1:2
tbl <- dat <- data.frame(x, y)
class(tbl) <- c("tbl", "data.frame")
tbl
#> # A data frame: 2 × 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     2     2

# this errors
ggplot(tbl, aes(x,y)) + geom_col()
#> Error: dplyr must be installed to work with tbl objects

# this will print ok
ggplot(as_tibble(dat), aes(x,y)) + geom_col()

Created on 2022-03-31 by the reprex package (v2.0.1)

TimTaylor avatar Mar 31 '22 10:03 TimTaylor

The standard class for data.frames in tibble is tbl_df which do not require dplyr to handle:

> class(tibble::as_tibble(mtcars))
[1] "tbl_df"     "tbl"        "data.frame"

https://github.com/tidyverse/ggplot2/blob/ae7fb41d33fd629cddbea484fc37b5f02dea3c41/R/fortify.r#L17 The code you are referring to is in place to allow support for the different backends such as databases etc which will require dplyr to handle correctly.

thomasp85 avatar Apr 01 '22 06:04 thomasp85

@thomasp85 I think my feeling was that these different back ends should have explicit support, rather than catch everything via the <tbl> class. Alternatively, and perhaps more reasonably, the emphasis could/should be on those backends to provide an as.data.frame method which {ggplot2} would then call for the fortify.tbl() method.

TimTaylor avatar Apr 01 '22 07:04 TimTaylor

Im not sure I understand the issue. The tbl class will require dplyr, unless we are using the tbl_df subclass. What do you want to achieve by having the dependency of dplyr by hidden in pillar?

thomasp85 avatar Apr 01 '22 15:04 thomasp85

Apologies it's always tricky to communicate in issues. I'm not asking for the dependency of dplyr to be hidden in pillar. I'll try and break down what I mean:

  • As far as I understand the intent of {pillar}, there is nothing about an object of class c("tbl", "data.frame") that should require it to be treated any different than a data frame. This seems to be supported by how it is utilised in the compat-vctrs.R file of rlang for instance (https://github.com/r-lib/rlang/blob/36b3c2146e10f9314e7bb77987015f2f9fe73ab3/R/compat-vctrs.R#L25-L30).
  • If we take the above point as correct then it would make sense for fortify.tbl to just call the fortify.data.frame method.
  • Because ggplot is trying to support known use cases of the non-tibble <tbl> subclasses (I guess predominantly the database ones) it is calling dplyr::collapse() (knowing that these packages have defined a collapse method).
  • I'd argue that this is not ideal for ggplot2 as it forces you to have knowledge of these external uses and how these objects are behaving. If you did want to target them I'd have thought being explicit would be better (perhaps a fortify.tbl_sql() method is sufficient).
  • If you did want to protect against non-data.frame type uses for <tbl>s I think it would be better to call as.data.frame() on them and put the burden on to the other packages to provide that method (In fact {dbplyr} already provides a method like this which wraps dplyr::collect()).

Note this only came up when I realised that in order to use {pillar} formatting in a vignette which also utilised {ggplot2} I needed to add {dplyr} as an additional suggested dependency. Nothing major but thought it felt a little odd.

Hope all of this makes sense and no worries if you want to keep the issue closed.

Best

TimTaylor avatar Apr 01 '22 15:04 TimTaylor

Hmmm... it is my clear understanding that tbl is a virtual class and you shouldn't really have objects that aren't subclasses of it. If you want to make a special data.frame like object you should subclass tbl_df.

Can you point me to the vignette where you are experiencing this behaviour?

@hadley have I misunderstood the class structure?

thomasp85 avatar Apr 01 '22 17:04 thomasp85

If you want to make a special data.frame like object you should subclass tbl_df.

pillar advertises making data.frame like objects as a subclass of <tbl> for the formatting benefits.

Can you point me to the vignette where you are experiencing this behaviour?

I'm afraid it's an internal (work) facing package but I'm not sure it adds to the discussion either way.

I'll put another example below that hopefully illustrate the role I thought {pillar} plays and why I'd like it to work in {ggplot} without the need for {dplyr}. Again I've just temporarily uninstalled {dplyr}.

library(pillar)
library(ggplot2)

# create a new book_tbl class to record books
my_library <- vctrs::new_data_frame(
    list(Book = c("Advanced R", "R packages", "Mort"),
         Author = c("H.Wickham", "H.Wickham", "T.Pratchett"),
         Category = c("Non-fiction", "Non-fiction", "Fiction")),
    class = c("books_tbl", "tbl")
)

# use pillar to customise the printing
tbl_sum.books_tbl <- function(x, ...) {
    c("My library" = sprintf("%d books", nrow(x)))
}

# print the nice looking library
my_library
#> # My library: 3 books
#>   Book       Author      Category   
#>   <chr>      <chr>       <chr>      
#> 1 Advanced R H.Wickham   Non-fiction
#> 2 R packages H.Wickham   Non-fiction
#> 3 Mort       T.Pratchett Fiction

# this will error
ggplot(my_library, aes(Category)) + geom_bar()
#> Error: dplyr must be installed to work with tbl objects

# this is fine
class(my_library) <- "data.frame"
ggplot(my_library, aes(Category)) + geom_bar()

Created on 2022-04-01 by the reprex package (v2.0.1)

TimTaylor avatar Apr 01 '22 20:04 TimTaylor

I'll await Hadley's comments on this, but IMO this is not expected usage. If you simply want a data.frame with nice formatting you create one with tibble, but I may be wrong in that regard. pillar is not a user facing package and I wouldn't read its documentation with the eyes of an end user.

thomasp85 avatar Apr 01 '22 21:04 thomasp85

to clarify - I am using pillar as a package developer not as an end user ... the examples above are just illustrations as to why.

TimTaylor avatar Apr 01 '22 21:04 TimTaylor

If you did want to protect against non-data.frame type uses for s I think it would be better to call as.data.frame() on them and put the burden on to the other packages to provide that method (In fact {dbplyr} already provides a method like this which wraps dplyr::collect()).

This sounds good to me.

Let me reopen the issue as it seems the discussion will continue at least for a while.

yutannihilation avatar Apr 02 '22 01:04 yutannihilation

I think we can probably change this, but someone will need to carefully explore which tbl subclasses currently rely on fortify.tbl and figure out what methods will be needed to replace them.

hadley avatar Apr 04 '22 15:04 hadley

Hi all, is there a risk of an user accidentally importing a large amount of data from databases or Spark? I fear that someone may pipe into, or use their remote tbl in ggplot() and not realize that it means that all of the data will be collected into R at that time

edgararuiz avatar Sep 20 '23 17:09 edgararuiz