purrr icon indicating copy to clipboard operation
purrr copied to clipboard

a tidy equivalent to `ave()` ?

Open moodymudskipper opened this issue 5 years ago • 5 comments

ave() is the base functional I use the most if I attach tidyverse, but it has several issues :

  • Its name, and default, are awkward, I was a bit intimidated of this function as a beginner
  • The doc is misleading, and only numeric input AND output are documented, so technically we "shouldn't" use it on characters and to return character for instance
  • The FUN argument must always be spelt explicitly as it comes after the dots
  • We can't use purrr lambdas
  • We can't pass additional arguments to dots, so many times we have to type a soul crushing (ok, maybe dramatizing a bit!) FUN = function(x) mean(x, na.rm = TRUE), when we'd really rather type mean, na.rm = TRUE
  • ave() can't output lists
  • Type stability (especially as ave() sometimes returns a "wrong type")
  • It could play well with vctrs
  • peformance could be improved it seems, given that Hadley's remarks from 2010 have not been considered as far as I can see

It might have been considered but since I found no relevant issues here it is.

Here's rough implementation (not type stable and not very efficient), to get my point across.

map_along <- function(.x, .along, .f, ...) {
  if (is.list(.along))
    .along <- interaction(.along) # as used by base::split.default
  .f <- rlang::as_function(.f)
  body <- as.call(c(quote(.f), quote(.x), eval(substitute(alist(...)))))
  fun <- rlang::new_function(alist(.x=), body)
  ave(.x, .along, FUN = fun)
}
map_along(c(1:4,NA),c(1,1,2,2,2), mean)
#> [1] 1.5 1.5  NA  NA  NA
map_along(c(1:4,NA),c(1,1,2,2,2), mean, na.rm = TRUE)
#> [1] 1.5 1.5 3.5 3.5 3.5
map_along(c(1:4,NA),c(1,1,2,2,2), ~mean(., na.rm = TRUE))
#> [1] 1.5 1.5 3.5 3.5 3.5

Created on 2020-01-13 by the reprex package (v0.3.0)

A neat additional feature would be to have a magical function or pronoun to access the relevant instance of .along in the loop, ave doesn't support this and I missed it a few times (helps with debugging too).

moodymudskipper avatar Jan 13 '20 11:01 moodymudskipper

probably something like this pattern (from experience with slide)

library(purrr)
library(vctrs)

attach(warpbreaks)

grouped_map_vec_impl <- function(.x, .g, .f, ..., .ptype, .constrain) {
  .f <- as_mapper(.f, ...)
  
  x_size <- vec_size(.x)
  
  if (x_size != vec_size(.g)) {
    stop("The size of `.x` and `.g` must be the same")
  }
  
  out <- vec_init(.ptype, n = x_size)
  
  info <- vec_group_loc(.g)
  locs <- info$loc
  
  slices <- vec_chop(.x, locs)
  n_slices <- vec_size(slices)
  
  for (i in seq_len(n_slices)) {
    loc <- locs[[i]]
    slice <- slices[[i]]
    
    res <- .f(slice, ...)
    
    if (.constrain) {
      if (vec_size(res) != 1L) {
        stop("All elements must be size 1")
      }
    } else {
      res <- list(res)
    }
    
    vec_slice(out, loc) <- res
  }
  
  out
}

# like map()
grouped_map <- function(.x, .g, .f, ...) {
  grouped_map_vec_impl(.x, .g, .f, ..., .ptype = list(), .constrain = FALSE)
}

grouped_map_vec_simplify <- function(.x, .g, .f, ...) {
  out <- grouped_map(.x, .g, .f, ...)
  
  # check all size 1
  if (!all(map_int(out, vec_size) == 1L)) {
    stop("All elements must be size 1")
  }
  
  vec_c(!!! out)
}

# like map_vec()
# - base for map_dbl() and friends
grouped_map_vec <- function(.x, .g, .f, ..., .ptype = NULL) {
  if (is.null(.ptype)) {
    out <- grouped_map_vec_simplify(.x, .g, .f, ...)
    return(out)
  }
  
  grouped_map_vec_impl(.x, .g, .f, ..., .ptype = .ptype, .constrain = TRUE)
}

grouped_map_dbl <- function(.x, .g, .f, ...) {
  grouped_map_vec(.x, .g, .f, ..., .ptype = double())
}
# like ave()
identical(
  grouped_map_dbl(breaks, wool, ~mean(.x)),
  ave(breaks, wool, FUN = mean)
)
#> [1] TRUE

# use a data frame for multiple groups
# this frees up the `...` for use with `.f`
identical(
  grouped_map_dbl(breaks, data.frame(wool = wool, tension = tension), mean, na.rm = TRUE),
  ave(breaks, wool, tension, FUN = mean)
)
#> [1] TRUE


# safe auto simplify with .ptype = NULL
grouped_map_vec(breaks, wool, mean)
#>  [1] 31.03704 31.03704 31.03704 31.03704 31.03704 31.03704 31.03704 31.03704
#>  [9] 31.03704 31.03704 31.03704 31.03704 31.03704 31.03704 31.03704 31.03704
#> [17] 31.03704 31.03704 31.03704 31.03704 31.03704 31.03704 31.03704 31.03704
#> [25] 31.03704 31.03704 31.03704 25.25926 25.25926 25.25926 25.25926 25.25926
#> [33] 25.25926 25.25926 25.25926 25.25926 25.25926 25.25926 25.25926 25.25926
#> [41] 25.25926 25.25926 25.25926 25.25926 25.25926 25.25926 25.25926 25.25926
#> [49] 25.25926 25.25926 25.25926 25.25926 25.25926 25.25926

grouped_map_vec(breaks, wool, class)
#>  [1] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [8] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#> [15] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#> [22] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#> [29] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#> [36] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#> [43] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#> [50] "numeric" "numeric" "numeric" "numeric" "numeric"


# allows non-size 1 results for each call to `.f`
grouped_map(breaks, wool, ~c(mean(.x), median(.x)))[25:30]
#> [[1]]
#> [1] 31.03704 26.00000
#> 
#> [[2]]
#> [1] 31.03704 26.00000
#> 
#> [[3]]
#> [1] 31.03704 26.00000
#> 
#> [[4]]
#> [1] 25.25926 24.00000
#> 
#> [[5]]
#> [1] 25.25926 24.00000
#> 
#> [[6]]
#> [1] 25.25926 24.00000

# does not allow non-size 1 results
grouped_map_vec(breaks, wool, ~c(mean(.x), median(.x)), .ptype = double())
#> Error in grouped_map_vec_impl(.x, .g, .f, ..., .ptype = .ptype, .constrain = TRUE): All elements must be size 1

grouped_map_vec(breaks, wool, ~c(mean(.x), median(.x)), .ptype = NULL)
#> Error in grouped_map_vec_simplify(.x, .g, .f, ...): All elements must be size 1

grouped_map(1, 1:2, mean)
#> Error in grouped_map_vec_impl(.x, .g, .f, ..., .ptype = list(), .constrain = FALSE): The size of `.x` and `.g` must be the same

DavisVaughan avatar Jan 20 '20 17:01 DavisVaughan

Different from https://github.com/r-lib/funs/issues/42#issue-530151360 because the output size here is vec_size(.x) and the output size there is vec_size(.g)

DavisVaughan avatar Jan 20 '20 17:01 DavisVaughan

Fantastic! It would go well with a tapply equivalent, which might be closer to hadley's function you linked above.

moodymudskipper avatar Jan 20 '20 18:01 moodymudskipper

@moodymudskipper FYI this is an incorrect pattern:

body <- as.call(c(quote(.f), quote(.x), eval(substitute(alist(...)))))

  1. It evaluates dots in the wrong environment.
  2. It evaluates dots repeatedly on repeated invokations.

In general there is no correct way to quote dots expressions than with enquos(). But I would only make quosures when data masking is needed, and the quoting causes issue (2) to happen.

Here is how to solve these issues:

.f <- rlang::as_function(.f)
fn <- function(x) .f(x, ...)

The ... are looked up lexically using normal rules of evaluation and thus will only be evaluated once (results are stored in the promises).

lionel- avatar Aug 05 '20 11:08 lionel-

Thanks Lionel! I greatly appreciate you taking time to educate me on this. I never realised ... were looked up lexically like other variables.

moodymudskipper avatar Aug 05 '20 12:08 moodymudskipper

This just doesn't feel like a purrr function to me — it feels more something that belongs in vctrs or funs?

hadley avatar Aug 24 '22 14:08 hadley