arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-11699: [R] Implement dplyr::across() for mutate()

Open thisisnic opened this issue 1 year ago • 2 comments

This PR introduces a partial implementation for dplyr::across() when called within dplyr::mutate().

arrow_table(iris) %>%
  mutate(across(starts_with("Sepal"), list(round, sqrt)))
#> Table (query)
#> Sepal.Length: double
#> Sepal.Width: double
#> Petal.Length: double
#> Petal.Width: double
#> Species: dictionary<values=string, indices=int8>
#> Sepal.Length_1: double (round(Sepal.Length, {ndigits=0, round_mode=HALF_TO_EVEN}))
#> Sepal.Length_2: double (sqrt_checked(Sepal.Length))
#> Sepal.Width_1: double (round(Sepal.Width, {ndigits=0, round_mode=HALF_TO_EVEN}))
#> Sepal.Width_2: double (sqrt_checked(Sepal.Width))
#> 
#> See $.data for the source Arrow object

I've opened a number of follow-up tickets for the tasks needed to be done to provide a more complete implementation:

thisisnic avatar Aug 02 '22 22:08 thisisnic

https://issues.apache.org/jira/browse/ARROW-11699

github-actions[bot] avatar Aug 02 '22 22:08 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Aug 02 '22 22:08 github-actions[bot]

This PR now also implements the .names argument.

thisisnic avatar Aug 23 '22 08:08 thisisnic

This is so close! I'm definitely in favor of deferring filter() and summarise() to another PR (although it will be confusing if they are not implemented before the next release). I really like how you split up the functions in dplyr-across.R...made it easy to review!

The one thing I think this needs is a set of tests directly on expand_across(). I think you can very easily convert the existing tests to something like:

library(rlang, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

testthat::expect_identical(
  arrow:::expand_across(
    data.frame(dbl = double(), dbl2 = double()),
    rlang::quos(across(c(dbl, dbl2), round))
  ),
  list(
    dbl = quo(round(dbl)),
    dbl2 = quo(round(dbl2))
  )
)

Created on 2022-08-26 by the reprex package (v2.0.1)

(With apologies for not suggesting that in my first review! I think those tests would help to communicate exactly what expand_across() is doing and highlight the fact that you did a really good job separating the implementation into highly testable pieces.)

@paleolimbot Thanks for the review! Quick question for you; I want to keep in the tests which have across() inside of mutate() as it makes it a lot easier to see "how are we matching dplyr's functionality" which I imagine will come up if when there are bug reports about this. Any tips as to which functionality to include in the expand_across() tests vs. the mutate(across(...) tests? Or is this going to be a pain, and I should just swap them all over?

thisisnic avatar Aug 30 '22 12:08 thisisnic

I think you nailed it - the mutate() tests are about matching dplyr's functionality (e.g., they will fail if dplyr changes something and makes this code out of sync); the expand_across() tests are about expand_across(); and we need both.

I would personally just copy/paste all the places where across() appears in your mutate tests but expect_identical(expand_across(quo(across(...)), ...) instead. If you feel this is out of scope for this PR I am OK with that, too (but I think it's worth trying).

paleolimbot avatar Aug 30 '22 13:08 paleolimbot

I think you nailed it - the mutate() tests are about matching dplyr's functionality (e.g., they will fail if dplyr changes something and makes this code out of sync); the expand_across() tests are about expand_across(); and we need both.

I would personally just copy/paste all the places where across() appears in your mutate tests but expect_identical(expand_across(quo(across(...)), ...) instead. If you feel this is out of scope for this PR I am OK with that, too (but I think it's worth trying).

I absolutely agree with the argument that this is a big function and it'd be worth having tests for it, though also wondering if copying and pasting every single test might be a bit excessive - we don't necessarily test intermediate functions in some of our other dplyr verbs (though perhaps we should). I might just check with @nealrichardson or @jonkeane to see if they see any middle ground in what needs to be tested - if there isn't, I'm happy to copy and paste everything and approach it that way.

thisisnic avatar Aug 30 '22 13:08 thisisnic

Haven't looked at the PR lately but, as I understand it, expand_across() takes a list of quosures and returns a list of quosures. So you should be able to unit-test all of its behavior in isolation, and then you want one or two tests in mutate() that it does the right thing in practice (integration tests). Of course, doing everything in mutate() has its conveniences; you could make it easier to write the unit tests with a test helper, something like:

expect_across_equal <- function(actual, expected) {
  expect_identical(expand_across(actual), expected)
}

expect_across_equal(
  quos(lgl, across(is.numeric, ~ . * 5)), 
  quos(lgl, int * 5, dbl * 5)
)

nealrichardson avatar Aug 30 '22 15:08 nealrichardson

@paleolimbot I've updated the tests as per your and Neals' useful suggestions. Mind giving it another review now?

thisisnic avatar Aug 31 '22 08:08 thisisnic