waldo icon indicating copy to clipboard operation
waldo copied to clipboard

Comparing large integer vectors slow

Open ChristopherEeles opened this issue 2 years ago • 2 comments

Problem

Comparing large (>1e6) integer vectors can be prohibitively slow after shuffling one vector.

RE: #116 I tried setting options(diffobj.max.diffs = 50), but this didn't help.

Reprex

v1 <- rep(1:50L, 1e6L / 50L)
v2 <- rep(1:50L, 1e6L / 50L)

v2 <- v2[sample.int(length(v2)]

bench::system_time(
    waldo::compare(v1, v2)
)
# process    real 
# 4.87m   4.87m 

Use Case

Integer indexes are commonly used in data.table for RDBMS style relational modelling. As a result of this slow comparison for integer vectors, comparing such data.tables is unreasonably slow (~5 minutes * number of columns).

This prevents adoption of testthat >= 3.0 in packages using largeish tables. (EDIT: I guess this isn't technically true, since examples in tests should be small, but it would be nice to be able to compare large tables).

The problem also exists for base::data.frame but since joins are so much slower the use case isn't as compelling.

Environment

R Under development (unstable) (2021-12-19 r81394)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] fansi_1.0.3     utf8_1.2.2      crayon_1.5.1    waldo_0.4.0    
 [5] lifecycle_1.0.1 magrittr_2.0.3  pillar_1.7.0    bench_1.1.2    
 [9] rlang_1.0.2     cli_3.2.0       diffobj_0.3.5   vctrs_0.4.0    
[13] ellipsis_0.3.2  rematch2_2.1.2  tools_4.2.0     glue_1.6.2     
[17] compiler_4.2.0  pkgconfig_2.0.3 tibble_3.1.6   

Note: Ubuntu is running in WSL2, but I also tested the code in R 4.1.3 on Windows 11 and the issue still exists.

ChristopherEeles avatar Apr 07 '22 22:04 ChristopherEeles

it's probably preferable to delegate to all.equal for cases like this -- not sure how general a solution waldo can be expected to handle.

expect_true(all.equal(x, y, ...))

will leverage data.table's method & efficiency

MichaelChirico avatar Apr 07 '22 23:04 MichaelChirico

@MichaelChirico waldo already uses identical() to speedily detect the simple case where both elements are equal.

I reduced the problem to a smaller example which only takes around 4s:

v1 <- rep(1:50L, 1e5L / 50L)
v2 <- rep(1:50L, 1e5L / 50L)
v2 <- v2[sample.int(length(v2))]
waldo::compare(v1, v2)

Interestingly the bottleneck problem is not the comparison (which only takes 0.4s) but building up the the display of the differences, which suggests that the max.diffs argument needs to be applied earlier.

I think this mostly reflects an edge case that is mostly unimportant for testing — you need to have both a very large vector and a very large number of differences:

v1 <- v2 <- rep(1:50L, 1e6L / 50L)
v2[1:1e4] <- v2[sample.int(1:1e4)]

bench::system_time(
  waldo::compare(v1, v2)
)
#> process    real 
#>   1.84s   1.84s

Created on 2022-04-08 by the reprex package (v2.0.1)

hadley avatar Apr 08 '22 11:04 hadley