dplyr
dplyr copied to clipboard
data.frame attributes are preserved on `mutate()` but dropped on `group_by |> mutate`
Reprex:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(purrr)
attr(mtcars, "test") <- "foo"
mtcars_grouped <- group_by(mtcars, cyl)
mtcars |>
mutate(new_col = 1) |>
attributes() |>
pluck("test")
#> [1] "foo"
mtcars_grouped |>
mutate(new_col = 1) |>
attributes() |>
pluck("test")
#> NULL
mtcars_grouped |>
mutate(across(where(is.complex), as.character)) |>
attributes() |>
pluck("test")
#> NULL
# did nothing but still lost attrs on the base data.frame
It would be better if no attributes were lost with mutate. Feels weird in the last case where no mutate actually gets done.
However if group_by() |> mutate() must drop attributes, it's probably better that mutate does also. You can easily get tricked into thinking some code is going to work, but then it bombs when it accidentally gets passed some sticky groups. This happened to me today, 4 levels of package context up from where the mutate was.
Hi all,
I see the same issue with a simple mutate() call, where attributes actually get dropped; see for reference: https://github.com/ellessenne/comorbidity/issues/51
Would this be fixed by https://github.com/tidyverse/dplyr/pull/6102?
Thanks!
Alessandro
I find that mutate also drops attributes whenever there is a a function call inside mutate, whereas simle mathematics works.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(labelled)
mtcars %>% set_variable_labels(cyl="Cylinders") %>% pull(cyl) %>% attributes() # Fine
#> $label
#> [1] "Cylinders"
mtcars %>% set_variable_labels(cyl="Cylinders") %>% mutate(cyl = cyl + 2) %>% pull(cyl) %>% attributes() # Fine
#> $label
#> [1] "Cylinders"
mtcars %>% set_variable_labels(cyl="Cylinders") %>% mutate(cyl = max(cyl)) %>% pull(cyl) %>% attributes() # Disappears
#> NULL
Created on 2022-07-17 by the reprex package (v2.0.1)
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23 ucrt)
#> os Windows 10 x64 (build 22000)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate nb.utf8
#> ctype nb.utf8
#> tz Europe/Berlin
#> date 2022-07-17
#> pandoc 2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.3)
#> cli 3.3.0 2022-04-25 [1] CRAN (R 4.1.3)
#> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.3)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
#> dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.1)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.5)
#> evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.2)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.3)
#> forcats 0.5.1 2021-01-27 [1] CRAN (R 4.0.3)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2)
#> haven 2.5.0 2022-04-15 [1] CRAN (R 4.1.3)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.0.5)
#> hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.1)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
#> knitr 1.39 2022-04-26 [1] CRAN (R 4.1.3)
#> labelled * 2.9.1 2022-05-05 [1] CRAN (R 4.2.1)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.3)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.3)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.2.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.0 2022-06-28 [1] CRAN (R 4.2.1)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0)
#> rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.1)
#> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.1.3)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
#> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.3)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.1)
#> tibble 3.1.7 2022-05-03 [1] CRAN (R 4.1.3)
#> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.1)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
#> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.1.3)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.2)
#> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0)
#> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2)
#>
──────────────────────────────────────────────────────────────────────────────
library(dplyr, warn.conflicts = FALSE)
attr(mtcars, "test") <- "foo"
mtcars |>
mutate(new_col = 1) |>
attr("test")
#> [1] "foo"
mtcars |>
mutate(new_col = 1, .by = cyl) |>
attr("test")
#> [1] "foo"
Created on 2022-12-15 with reprex v2.0.2
Since this problem is resolved by .by, I'm going to close this issue. Fixing it via group_by() is surprisingly challenging because we have to forward attributes into a new data structure, whereas .by completely circumvents the problem because there's no intermediate data structure.
Maybe not the biggest deal, but this doesn't solve the problem in the original context I mentioned, where a user passes something grouped across an interface boundary.
As an author it'd be nice to be able to write functions that 'just work' within groups or on the whole dataframe as a single group, but things like this mean you have to add special handling code to cater for attributes and groups in combination, AND you also have to know this obscure problem exists to put that code in.
I think it'll continue to catch a small number of people out.
Unless you go full deprecation on group_by in favour of .by. :stuck_out_tongue_closed_eyes:
Does mutate have enough context to issue a warning it's about to drop the attributes?
In general, I think it's risky to rely on random attributes being magically pass along through any operation. We've done our best to support it in most places in dplyr, but for grouped mutates with random attributes, it doesn't feel to me like the benefit is worth the implementation cost, especially given that there's now an alternative available.
I doubt we can deprecate group_by(), but if .by is successful, I think there's a good chance we'll supersede it.