dplyr
dplyr copied to clipboard
Feature request/Question: do not drop extra classes (and attributes) with functions group_by, summarise and so
We are building some packages on top of all the dplyr + dbplyr infrastructure (very grateful for that) and we build some classes like 'generated_cohort_set', 'cdm_reference', 'cdm_table', 'codelist' and so.
One problem that we are facing is that there are some functions (group_by, summarise, ...) that drop the classes (see reprex). I guess that this is on purpose, but wondering why and if it is something that could be considered to be implemented in the future?
here are the packages if you are curious: https://cran.r-project.org/web/packages/CDMConnector/index.html, https://cran.r-project.org/web/packages/DrugUtilisation/index.html, https://cran.r-project.org/web/packages/PatientProfiles/index.html, https://cran.r-project.org/web/packages/IncidencePrevalence/index.html ...)
x <- dplyr::tibble(a = 1)
class(x) <- c("my_class", class(x))
class(x)
#> [1] "my_class" "tbl_df" "tbl" "data.frame"
x |> dplyr::mutate(b = 1) |> class()
#> [1] "my_class" "tbl_df" "tbl" "data.frame"
x |> dplyr::group_by(a) |> class()
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
Created on 2023-12-05 with reprex v2.0.2
FYI @edward-burn @ablack3
@catalamarti I was facing the same issue before and there is some documentation on how to extend tibbles here: https://dplyr.tidyverse.org/reference/dplyr_extending.html
There they also state that for example dplyr::group_by and dplyr::ungroup do drop attributes and classes.
Unfortunately, if you have custom attributes, they are dropped even if they don't depend on the rows or columns, contrary to what is documented on the vignette.
I am currently writing a small post on how I ended up solving it and will comment it here once I am done.
Just wondering if there is any potential resolution to this @hadley. In the link mentioned above, https://dplyr.tidyverse.org/reference/dplyr_extending.html it also says "These functions are a stop-gap measure" so I'm not sure whether to incorporate these in packages that depend on dplyr, or if the better approach (at least in the short-term) is to create method for every dplyr verb to handle the above situations?
group_by()creates a fundamentally different type of data structure, and we have no way of knowing if it is compatible with your class, so we have to drop it. If you want to supported a grouped data frame structure then you can write an S3 method forgroup_by(), but it is typically easier to use something likemutate(.by =)as that will preserve your class and let you do the grouped operation, so you don't have to worry about thegrouped_dfclass at all, it never exists in that workflowsummarise()similarly builds off the data fromgroup_data(), which is always a bare tibble or bare data frame. In the same vein asgroup_by(), we don't know if the summarized table (which has a very different structure that the original one) is still compatible with your class, so we drop it. You'd also need an S3 method for this.
This is documented here https://dplyr.tidyverse.org/reference/dplyr_extending.html and here https://dplyr.tidyverse.org/reference/summarise.html#value
tsibble is an example of a tibble subclass that has support for custom grouped data frames and a custom summarise method, if you want to look at that. They are also a good example of how dplyr can't know if the result of summarise() is valid for your class or not. In some cases the result is still a tsibble, in other cases they return a bare tibble. https://github.com/tidyverts/tsibble
@catalamarti Took some time to write my article due to a lot of things going on, but if it still helps you, here it is: https://www.bio-ai.org/blog/extending-tibbles/