Decide when parallelization takes place in cpp version
add_n_miss and add_prop_miss now have a .parallel argument, which was promoted there from the C++ side.
Eventually, I'd like to put it back to the C++ side, but for now it gives us some opportunity to tune in when we want the parallel version to be used. (the default is to use parallel when there are more than 10000 rows).
Some preliminary :
library(microbenchmark)
library(dplyr)
bench <- function(n = 10000, data = airquality ){
d <- slice( data, sample(seq_len(nrow(data)), n, replace = TRUE) )
microbenchmark(
par = add_n_miss(d, .parallel = TRUE),
serial = add_n_miss(d, .parallel = FALSE)
)
}
# not worth using parallel here
> bench(1000)
Unit: microseconds
expr min lq mean median uq max neval cld
par 50.063 58.5810 63.90178 61.3890 65.7140 171.812 100 b
serial 32.887 35.1995 39.28407 36.3645 37.4725 203.307 100 a
# break even between cost of parallel and performance gain
> bench(10000)
Unit: microseconds
expr min lq mean median uq max neval cld
par 70.867 86.7455 104.65175 102.6900 110.7485 259.294 100 b
serial 68.235 91.1535 94.73477 93.2545 94.3320 139.636 100 a
# then taking advantage of it
> bench(1e6)
Unit: milliseconds
expr min lq mean median uq max neval cld
par 2.323986 2.882439 5.809276 3.341341 4.243813 182.4460 100 a
serial 4.916264 6.102976 9.398288 6.766492 8.638943 195.2889 100 a
> bench(1e7)
Unit: milliseconds
expr min lq mean median uq max neval cld
par 28.24776 32.28037 42.57060 34.03642 45.83196 241.4553 100 a
serial 67.69803 75.96788 86.21276 79.41009 93.36936 284.4946 100 b
@njtierney the rule should not be too complex
Awesome! :)
I’ll let you play with this (perhaps varying the number of columns as well) but it looks like my initial 10000 was not too bad.
Not bad at all!
I've just run this with a random matrix that generates a specified number of columns or rows.
First, I replicate your results on my machine
library(microbenchmark)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(naniar)
bench <- function(n = 10000, data = airquality ){
d <- slice(data, sample(seq_len(nrow(data)), n, replace = TRUE) )
microbenchmark(
par = add_n_miss(d, .parallel = TRUE),
serial = add_n_miss(d, .parallel = FALSE)
)
}
bench(1000)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> par 52.942 57.9025 65.76265 61.558 65.9450 311.301 100 b
#> serial 29.481 31.1645 42.74648 33.295 34.9585 896.108 100 a
bench(10000)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> par 76.999 88.2325 100.4541 95.0185 104.3805 208.656 100 a
#> serial 82.775 90.5545 104.1372 105.4470 108.2400 205.606 100 a
bench(1e6)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> par 1.661647 2.040611 5.532474 2.602328 4.120704 127.64148 100 a
#> serial 5.201891 5.940883 8.247077 7.173730 9.186424 22.58627 100 a
bench(1e7)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> par 16.74651 19.26304 28.79824 21.69526 32.25247 183.3326 100 a
#> serial 64.51201 71.60433 86.03827 83.04596 91.60811 248.2516 100 b
So far so good.
Next, I make a function to generate data of a particular row and column size with missing values
bench_rc <- function(col_size, row_size){
n_elements <- col_size * row_size
raw <- rnorm(n_elements)
# set about 30% of values to NA
raw[sample(x = 1:n_elements, size = floor(n_elements/3))] <- NA
d <- matrix(data = raw,
nrow = col_size,
ncol = row_size) %>%
tibble::as_tibble()
microbenchmark(
par = add_n_miss(d, .parallel = TRUE),
serial = add_n_miss(d, .parallel = FALSE)
)
}
bench_rc(col_size = 100,
row_size = 100)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> par 106.446 128.939 156.3139 139.728 172.5495 372.372 100 b
#> serial 95.549 114.790 127.0468 121.754 133.4420 222.849 100 a
bench_rc(col_size = 100,
row_size = 1000)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> par 843.265 1148.941 1327.127 1306.700 1467.245 2039.393 100 b
#> serial 756.137 862.329 1012.803 1005.898 1090.797 1845.178 100 a
bench_rc(col_size = 1000,
row_size = 100)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> par 369.979 507.6235 640.2205 608.1715 735.2590 1373.724 100 a
#> serial 660.678 709.8110 808.2027 772.8890 862.1445 1495.670 100 b
So for 1000 rows, 100 columns, serial is faster
But for 100 rows, and 1000 columns parallel is faster.
This then carries through.
bench_rc(col_size = 1000,
row_size = 1000)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> par 4.482235 5.430947 9.121563 6.558558 8.801652 41.68146 100 a
#> serial 6.632624 6.914668 7.993161 7.309024 7.968745 15.56544 100 a
bench_rc(col_size = 10000,
row_size = 1000)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> par 19.86186 24.51354 33.79760 26.56344 31.87995 151.3237 100 a
#> serial 53.86446 63.93033 70.24344 68.46899 73.44370 108.0184 100 b
bench_rc(col_size = 1000,
row_size = 10000)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> par 88.05169 118.87281 152.01282 126.62703 148.66614 418.81963 100
#> serial 57.69362 69.94338 73.37348 72.58512 76.08338 96.75694 100
#> cld
#> b
#> a
But it seems like there is a bit more to it...
But still, interesting! More to do.
Romain, do you think this behaviour makes sense?
I think what I need to do actually is set up a simulation study that goes through the number of columns and rows and compares them.
I need to revisit based on https://romain.rbind.io/blog/2017/10/24/fast-counting-na/ and the follow up discussion on twitter (that only affects numeric vectors though).
All bets are off when parallel code is involved, this is likely to be memory bound.
OK cool!
By all bets are off, do you mean that we should probably just pick a set number size for when to implement parallel code?
I'm keen to roll these cpp changes out this week into master - it seems like there is this number of around 10K - if the n_row x n_col is around 10K, then parallel is usually faster, but not always...any thoughts on this?
I think that in the interim I'm going to roll in @jimhester's rowMeans solution, just so I can get a nice speed boost until we work out the mechanics of how to get the cpp version working smoothly.
It'd be great @romainfrancois if we can try and get this merged into master by the end of March,
Do you have any thoughts on when to decide to use the parallel feature? At the moment i looks like if there are 10K elements (row x col), then parallel is usually faster?
Yep that seemed to be the threshold. I need to use isnan directly instead of Rcpp’s NumericVector::is_na, that shouls be faster in all cases.
OK awesome!
Will have a 👀 this week.
Hi @romainfrancois
I was wondering if you think we can wrap this up for the 0.4.0 release, which I am aiming for July 9ish, otherwise we can aim for the 0.5.0 release in August? I understand if you have other things that take priority, so I'm happy to take things from here if you like?
I'm going to retire this as at the moment I don't have the scope to manage a cpp branch of naniar, although a lot of great things came out of these discussions, so thank you, @romainfrancois :)