tidytable
tidytable copied to clipboard
`summarise.()` do not support `data.frame` output when `.by` is specified
See the following example:
library(tidytable)
#>
#> Attaching package: 'tidytable'
#> The following object is masked from 'package:stats':
#>
#> dt
#> The following object is masked from 'package:base':
#>
#> %in%
dat <- tidytable(g = rep(c("A", "B"), 100)) |>
mutate.(val = rnorm(n.()))
fun <- function(x) {
data.frame(mean = mean(x), max = max(x))
}
dat |>
summarise.(fun(val), .by = g)
#> Error in `[.data.table`(~.df, , .(fun(val)), keyby = "g"): All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.
dat |>
dplyr::group_by(g) |>
dplyr::summarise(fun(val), .groups = "drop")
#> # A tibble: 2 × 3
#> g mean max
#> <chr> <dbl> <dbl>
#> 1 A -0.00809 2.73
#> 2 B 0.230 2.47
Created on 2022-08-29 with reprex v2.0.2
I'm a bit surprised this works without .by
to be honest. I'll take a look and see what I can figure out.
@psychelzh unfortunately I won't be able to implement this in tidytable
. I can get it to work pretty easily by utilizing vctrs::df_list()
inside of data.table's summarize. However there is a pretty large performance loss when it is implemented.
Here are the benchmarks:
pacman::p_load(tidytable, data.table, vctrs)
setDTthreads(4)
set.seed(123)
random_strings <- c('OoVt', 'wCbu', 'cXxX', 'jdFu', 'MCRx',
'ukhz', 'ikce', 'PHyu', 'jpBY', 'nLQM')
data_size <- 10000000
df <- tidytable(a = as.double(sample(1:20, data_size, TRUE)),
b = as.double(sample(1:20, data_size, TRUE)),
c = sample(random_strings, data_size, TRUE),
d = sample(random_strings, data_size, TRUE))
bench::mark(
vctrs_df_list = df[, vctrs::df_list(avg_a = mean(a), med_b = median(b)), by = .(c, d)],
normal_list = df[, list(avg_a = mean(a), med_b = median(b)), by = .(c, d)],
check = FALSE, iterations = 11
) %>%
select(expression, min, median)
#> # A tidytable: 2 × 3
#> expression min median
#> <bch:expr> <bch:tm> <bch:tm>
#> 1 vctrs_df_list 675ms 681ms
#> 2 normal_list 534ms 550ms
As I dig into this more - it looks like data.table
's GForce optimizations are turned off if you don't use list()
.
Got it! How about implement it anyway, but document its possible performance loss?
Hmm I'm not sure I want to add a feature that could cause this type of performance loss. Especially since it's an edge case feature.
Maybe I could add an argument that allows you to opt-in to this? It would be different than dplyr, but at least the functionality would exist. Something like .unpack = TRUE
?
dat |>
summarise.(fun(val), .by = g, .unpack = TRUE)
It is reasonable. Opt-in is what I meant by my last comment.
All set.
library(tidytable, warn.conflicts = FALSE)
#> As of tidytable v0.9.0 dotless versions of functions are exported.
#> You can now use `arrange()`/`mutate()`/etc. directly.
df <- tidytable(group = c("a", "a", "b"), val = 1:3)
fun <- function(x) {
data.frame(mean = mean(x), max = max(x))
}
df %>%
summarise(fun(val), .by = group, .unpack = TRUE)
#> # A tidytable: 2 × 3
#> group mean max
#> <chr> <dbl> <int>
#> 1 a 1.5 2
#> 2 b 3 3