vctrs icon indicating copy to clipboard operation
vctrs copied to clipboard

Performance issues `vec_rbind()`?

Open mgirlich opened this issue 3 years ago • 1 comments

When binding many 1 row tibbles vec_c() is 20% to 40% faster than vec_rbind(). I would have expected vec_rbind() to be faster as this seems to be the main purpose of it.

library(vctrs)

row_list1 <- vec_rep(vec_chop(mtcars), 1e3)
row_list10 <- vec_rep(vec_chop(mtcars), 10e3)
ptype <- vec_ptype(row_list1[[1]])

bench::mark(
  vec_c1 = vec_c(!!!row_list1, .ptype = ptype),
  vec_rbind1 = vec_rbind(!!!row_list1, .ptype = ptype),
  check = TRUE,
  iterations = 3
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_c1        151ms    217ms      4.70    8.47MB     6.26
#> 2 vec_rbind1    161ms    208ms      4.74    7.49MB     7.90

bench::mark(
  vec_c10 = vec_c(!!!row_list10, .ptype = ptype),
  vec_rbind10 = vec_rbind(!!!row_list10, .ptype = ptype),
  check = TRUE,
  iterations = 3
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_c10        1.81s    2.04s     0.507    87.7MB     1.01
#> 2 vec_rbind10    2.65s    2.72s     0.364    71.8MB     1.34

Created on 2021-10-09 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.1.0 (2021-05-18)
#>  os       macOS Big Sur 10.16         
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       UTC                         
#>  date     2021-10-09                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version     date       lib source                            
#>  backports     1.2.1       2020-12-09 [1] CRAN (R 4.1.0)                    
#>  bench         1.1.1       2020-01-13 [1] CRAN (R 4.1.0)                    
#>  cli           3.0.1.9000  2021-10-07 [1] Github (r-lib/cli@2808311)        
#>  crayon        1.4.1       2021-02-08 [1] CRAN (R 4.1.0)                    
#>  digest        0.6.28      2021-09-23 [1] CRAN (R 4.1.0)                    
#>  ellipsis      0.3.2       2021-04-29 [1] CRAN (R 4.1.0)                    
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 4.1.0)                    
#>  fansi         0.5.0       2021-05-25 [1] CRAN (R 4.1.0)                    
#>  fastmap       1.1.0       2021-01-25 [1] CRAN (R 4.1.0)                    
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 4.1.0)                    
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 4.1.0)                    
#>  highr         0.9         2021-04-16 [1] CRAN (R 4.1.0)                    
#>  htmltools     0.5.2       2021-08-25 [1] CRAN (R 4.1.0)                    
#>  knitr         1.36        2021-09-29 [1] CRAN (R 4.1.0)                    
#>  lifecycle     1.0.1       2021-09-24 [1] CRAN (R 4.1.0)                    
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 4.1.0)                    
#>  pillar        1.6.3       2021-09-26 [1] CRAN (R 4.1.0)                    
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.1.0)                    
#>  profmem       0.6.0       2020-12-13 [1] CRAN (R 4.1.0)                    
#>  purrr         0.3.4       2020-04-17 [1] CRAN (R 4.1.0)                    
#>  R.cache       0.15.0      2021-04-30 [1] CRAN (R 4.1.0)                    
#>  R.methodsS3   1.8.1       2020-08-26 [1] CRAN (R 4.1.0)                    
#>  R.oo          1.24.0      2020-08-26 [1] CRAN (R 4.1.0)                    
#>  R.utils       2.11.0      2021-09-26 [1] CRAN (R 4.1.0)                    
#>  reprex        2.0.1       2021-08-05 [1] CRAN (R 4.1.0)                    
#>  rlang         0.99.0.9000 2021-10-09 [1] Github (r-lib/rlang@d0dee64)      
#>  rmarkdown     2.11        2021-09-14 [1] CRAN (R 4.1.0)                    
#>  rstudioapi    0.13        2020-11-12 [1] CRAN (R 4.1.0)                    
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 4.1.0)                    
#>  stringi       1.7.5       2021-10-04 [1] CRAN (R 4.1.0)                    
#>  stringr       1.4.0.9000  2021-08-23 [1] Github (tidyverse/stringr@6670a37)
#>  styler        1.6.2       2021-09-23 [1] CRAN (R 4.1.0)                    
#>  tibble        3.1.5       2021-09-30 [1] CRAN (R 4.1.0)                    
#>  utf8          1.2.2       2021-07-24 [1] CRAN (R 4.1.0)                    
#>  vctrs       * 0.3.8.9001  2021-10-09 [1] Github (r-lib/vctrs@199da1a)      
#>  withr         2.4.2       2021-04-18 [1] CRAN (R 4.1.0)                    
#>  xfun          0.26        2021-09-14 [1] CRAN (R 4.1.0)                    
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 4.1.0)                    
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

mgirlich avatar Oct 09 '21 06:10 mgirlich

Could this have to do with how names are handled?

For my own use case, I have many one-row tibbles, and I would like to call vec_rbind() internally in a package (c.f. https://github.com/wlandau/crew/discussions/123). The package makes sure all the names are already consistent and correct, so I do not need any name checking or name repair. On my machine, the fastest supported name repair option is responsible for 50-60% of the execution time. It would be great to be able to disable name processing completely and cut out the overhead.

packageVersion("data.table")
#> [1] ‘1.14.8’
packageVersion("vctrs")
#> [1] ‘0.6.3’
result <- crew:::monad_tibble(crew::crew_eval(12))
list <- replicate(1e6, result, simplify = FALSE)
system.time(data.table::rbindlist(list, use.names = FALSE))
#>    user  system elapsed 
#>   0.924   0.014   0.940
system.time(vctrs::vec_rbind(list, .name_repair = "universal_quiet"))
#>    user  system elapsed 
#>   1.338   0.061   1.400
proffer::pprof(vctrs::vec_rbind(list, .name_repair = "universal_quiet"))

Screenshot 2023-09-20 at 3 06 00 PM

wlandau avatar Sep 20 '23 19:09 wlandau