waldo
waldo copied to clipboard
Comparing large integer vectors slow
Problem
Comparing large (>1e6) integer vectors can be prohibitively slow after shuffling one vector.
RE: #116 I tried setting options(diffobj.max.diffs = 50)
, but this didn't help.
Reprex
v1 <- rep(1:50L, 1e6L / 50L)
v2 <- rep(1:50L, 1e6L / 50L)
v2 <- v2[sample.int(length(v2)]
bench::system_time(
waldo::compare(v1, v2)
)
# process real
# 4.87m 4.87m
Use Case
Integer indexes are commonly used in data.table
for RDBMS style relational modelling. As a result of this slow comparison for integer vectors, comparing such data.table
s is unreasonably slow (~5 minutes * number of columns).
This prevents adoption of testthat
>= 3.0 in packages using largeish tables. (EDIT: I guess this isn't technically true, since examples in tests should be small, but it would be nice to be able to compare large tables).
The problem also exists for base::data.frame
but since joins are so much slower the use case isn't as compelling.
Environment
R Under development (unstable) (2021-12-19 r81394)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] fansi_1.0.3 utf8_1.2.2 crayon_1.5.1 waldo_0.4.0
[5] lifecycle_1.0.1 magrittr_2.0.3 pillar_1.7.0 bench_1.1.2
[9] rlang_1.0.2 cli_3.2.0 diffobj_0.3.5 vctrs_0.4.0
[13] ellipsis_0.3.2 rematch2_2.1.2 tools_4.2.0 glue_1.6.2
[17] compiler_4.2.0 pkgconfig_2.0.3 tibble_3.1.6
Note: Ubuntu is running in WSL2, but I also tested the code in R 4.1.3 on Windows 11 and the issue still exists.
it's probably preferable to delegate to all.equal for cases like this -- not sure how general a solution waldo can be expected to handle.
expect_true(all.equal(x, y, ...))
will leverage data.table's method & efficiency
@MichaelChirico waldo already uses identical()
to speedily detect the simple case where both elements are equal.
I reduced the problem to a smaller example which only takes around 4s:
v1 <- rep(1:50L, 1e5L / 50L)
v2 <- rep(1:50L, 1e5L / 50L)
v2 <- v2[sample.int(length(v2))]
waldo::compare(v1, v2)
Interestingly the bottleneck problem is not the comparison (which only takes 0.4s) but building up the the display of the differences, which suggests that the max.diffs
argument needs to be applied earlier.
I think this mostly reflects an edge case that is mostly unimportant for testing — you need to have both a very large vector and a very large number of differences:
v1 <- v2 <- rep(1:50L, 1e6L / 50L)
v2[1:1e4] <- v2[sample.int(1:1e4)]
bench::system_time(
waldo::compare(v1, v2)
)
#> process real
#> 1.84s 1.84s
Created on 2022-04-08 by the reprex package (v2.0.1)