dplyr data.frame attributes are preserved on `mutate()` but dropped on `group

Reprex:

library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union
library(purrr)
attr(mtcars, "test") <- "foo"
mtcars_grouped <- group_by(mtcars, cyl)

mtcars |>
  mutate(new_col = 1) |>
  attributes() |>
  pluck("test")
#> [1] "foo"

mtcars_grouped |>
  mutate(new_col = 1) |>
  attributes() |>
  pluck("test")
#> NULL

mtcars_grouped |>
  mutate(across(where(is.complex), as.character)) |>
  attributes() |>
  pluck("test")
#> NULL
# did nothing but still lost attrs on the base data.frame

It would be better if no attributes were lost with mutate. Feels weird in the last case where no mutate actually gets done.

However if group_by() |> mutate() must drop attributes, it's probably better that mutate does also. You can easily get tricked into thinking some code is going to work, but then it bombs when it accidentally gets passed some sticky groups. This happened to me today, 4 levels of package context up from where the mutate was.

Nov 29 '21 05:11 MilesMcBain

Hi all, I see the same issue with a simple mutate() call, where attributes actually get dropped; see for reference: https://github.com/ellessenne/comorbidity/issues/51 Would this be fixed by https://github.com/tidyverse/dplyr/pull/6102? Thanks!

Alessandro

Mar 02 '22 07:03 ellessenne

I find that mutate also drops attributes whenever there is a a function call inside mutate, whereas simle mathematics works.

    library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
    library(labelled)
    mtcars %>% set_variable_labels(cyl="Cylinders") %>% pull(cyl) %>% attributes() # Fine
#> $label
#> [1] "Cylinders"
mtcars %>% set_variable_labels(cyl="Cylinders") %>% mutate(cyl = cyl + 2) %>% pull(cyl) %>% attributes() # Fine
#> $label
#> [1] "Cylinders"
mtcars %>% set_variable_labels(cyl="Cylinders") %>% mutate(cyl = max(cyl)) %>% pull(cyl) %>% attributes() # Disappears
#> NULL

^{Created on 2022-07-17 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  nb.utf8
#>  ctype    nb.utf8
#>  tz       Europe/Berlin
#>  date     2022-07-17
#>  pandoc   2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.3)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.1.3)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
#>  DBI           1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr       * 1.0.9   2022-04-28 [1] CRAN (R 4.2.1)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.0.5)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.1.2)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
#>  forcats       0.5.1   2021-01-27 [1] CRAN (R 4.0.3)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
#>  haven         2.5.0   2022-04-15 [1] CRAN (R 4.1.3)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.0.5)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.1.1)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.1)
#>  knitr         1.39    2022-04-26 [1] CRAN (R 4.1.3)
#>  labelled    * 2.9.1   2022-05-05 [1] CRAN (R 4.2.1)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.3)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.3)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.2.1)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.0  2022-06-28 [1] CRAN (R 4.2.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.1)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.0)
#>  rlang         1.0.4   2022-07-12 [1] CRAN (R 4.2.1)
#>  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.1.3)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.3)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.3)
#>  styler        1.7.0   2022-03-13 [1] CRAN (R 4.2.1)
#>  tibble        3.1.7   2022-05-03 [1] CRAN (R 4.1.3)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.2.1)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.1.3)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.2.0)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
#> 
──────────────────────────────────────────────────────────────────────────────

Jul 17 '22 12:07 sda030

library(dplyr, warn.conflicts = FALSE)
attr(mtcars, "test") <- "foo"

mtcars |>
  mutate(new_col = 1) |>
  attr("test")
#> [1] "foo"

mtcars |>
  mutate(new_col = 1, .by = cyl) |>
  attr("test")
#> [1] "foo"

^{Created on 2022-12-15 with reprex v2.0.2}

Dec 15 '22 20:12 hadley

Since this problem is resolved by .by, I'm going to close this issue. Fixing it via group_by() is surprisingly challenging because we have to forward attributes into a new data structure, whereas .by completely circumvents the problem because there's no intermediate data structure.

Dec 15 '22 20:12 hadley

Maybe not the biggest deal, but this doesn't solve the problem in the original context I mentioned, where a user passes something grouped across an interface boundary.

As an author it'd be nice to be able to write functions that 'just work' within groups or on the whole dataframe as a single group, but things like this mean you have to add special handling code to cater for attributes and groups in combination, AND you also have to know this obscure problem exists to put that code in.

I think it'll continue to catch a small number of people out.

Unless you go full deprecation on group_by in favour of .by. :stuck_out_tongue_closed_eyes:

Dec 15 '22 22:12 MilesMcBain

Does mutate have enough context to issue a warning it's about to drop the attributes?

Dec 15 '22 22:12 MilesMcBain

In general, I think it's risky to rely on random attributes being magically pass along through any operation. We've done our best to support it in most places in dplyr, but for grouped mutates with random attributes, it doesn't feel to me like the benefit is worth the implementation cost, especially given that there's now an alternative available.

I doubt we can deprecate group_by(), but if .by is successful, I think there's a good chance we'll supersede it.

Dec 15 '22 23:12 hadley

dplyr
dplyr copied to clipboard

data.frame attributes are preserved on `mutate()` but dropped on `group_by |> mutate`

dplyr dplyr copied to clipboard

data.frame attributes are preserved on `mutate()` but dropped on `group_by |> mutate`

dplyr
dplyr copied to clipboard