vctrs icon indicating copy to clipboard operation
vctrs copied to clipboard

Quadratic memory consumption in `vec_unchop()` for `list_of` prototype

Open mgirlich opened this issue 4 years ago • 1 comments

I was wondering why the new implementation of unnest_wider() is so slow and memory hungry for tibble columns. It turned out to be an issue with vec_unchop():

library(vctrs)


make_list_of <- function(n) {
  df <- tibble::tibble(
    x = new_list_of(vec_chop(1:n), ptype = integer())
  )
  vec_chop(df)
}

df_list1 <- make_list_of(1e3)
df_list2 <- make_list_of(2e3)
df_list4 <- make_list_of(4e3)
df_list8 <- make_list_of(8e3)

ptype <- vec_ptype(df_list1[[1]])

bench::mark(
  df1 = vec_unchop(df_list1, ptype = ptype),
  df2 = vec_unchop(df_list2, ptype = ptype),
  df4 = vec_unchop(df_list4, ptype = ptype),
  df8 = vec_unchop(df_list8, ptype = ptype),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 df1         60.76ms  64.61ms    15.3      15.4MB     21.0
#> 2 df2        149.47ms 156.88ms     6.42     61.3MB     17.7
#> 3 df4           428ms 452.35ms     2.21    244.6MB     12.2
#> 4 df8           1.32s    1.32s     0.758   977.4MB     17.4

Created on 2021-11-16 by the reprex package (v2.0.1)

mgirlich avatar Nov 16 '21 07:11 mgirlich

I think this is mainly due to the fact that df-assign has to proxy and restore the output container at every iteration. i.e. recursive proxy/restore would really help here https://github.com/r-lib/vctrs/issues/1107

Compare against just combining list-ofs, with no data frame involved:

library(vctrs)

make_list_of <- function(n) {
  new_list_of(as.list(1:n), ptype = integer())
}

df_list1 <- make_list_of(1e3)
df_list2 <- make_list_of(2e3)
df_list4 <- make_list_of(4e3)
df_list8 <- make_list_of(8e3)

ptype <- vec_ptype(df_list1[[1]])

bench::mark(
  df1 = vec_unchop(df_list1, ptype = ptype),
  df2 = vec_unchop(df_list2, ptype = ptype),
  df4 = vec_unchop(df_list4, ptype = ptype),
  df8 = vec_unchop(df_list8, ptype = ptype),
  check = FALSE
)
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 df1        184.23µs 235.76µs     4172.    15.4KB     12.3
#> 2 df2        442.06µs 468.91µs     2115.    15.7KB     14.6
#> 3 df4        882.36µs 928.21µs     1071.    31.3KB     12.4
#> 4 df8          1.77ms   1.89ms      512.    62.6KB     14.9

Created on 2022-02-15 by the reprex package (v2.0.1)

DavisVaughan avatar Feb 15 '22 22:02 DavisVaughan