naniar Decide when parallelization takes place in cpp version

add_n_miss and add_prop_miss now have a .parallel argument, which was promoted there from the C++ side.

Eventually, I'd like to put it back to the C++ side, but for now it gives us some opportunity to tune in when we want the parallel version to be used. (the default is to use parallel when there are more than 10000 rows).

Some preliminary :

library(microbenchmark)
library(dplyr)

bench <- function(n = 10000, data = airquality ){
  d <- slice( data, sample(seq_len(nrow(data)), n, replace = TRUE) )

  microbenchmark(
    par = add_n_miss(d, .parallel = TRUE),
    serial = add_n_miss(d, .parallel = FALSE)
  )

}

# not worth using parallel here
> bench(1000)
Unit: microseconds
   expr    min      lq     mean  median      uq     max neval cld
    par 50.063 58.5810 63.90178 61.3890 65.7140 171.812   100   b
 serial 32.887 35.1995 39.28407 36.3645 37.4725 203.307   100  a 

# break even between cost of parallel and performance gain 
> bench(10000)
Unit: microseconds
   expr    min      lq      mean   median       uq     max neval cld
    par 70.867 86.7455 104.65175 102.6900 110.7485 259.294   100   b
 serial 68.235 91.1535  94.73477  93.2545  94.3320 139.636   100  a

# then taking advantage of it
> bench(1e6)
Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
    par 2.323986 2.882439 5.809276 3.341341 4.243813 182.4460   100   a
 serial 4.916264 6.102976 9.398288 6.766492 8.638943 195.2889   100   a

> bench(1e7)
Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
    par 28.24776 32.28037 42.57060 34.03642 45.83196 241.4553   100  a 
 serial 67.69803 75.96788 86.21276 79.41009 93.36936 284.4946   100   b

@njtierney the rule should not be too complex

Oct 20 '17 10:10 romainfrancois

Awesome! :)

Oct 23 '17 06:10 njtierney

I’ll let you play with this (perhaps varying the number of columns as well) but it looks like my initial 10000 was not too bad.

Oct 23 '17 07:10 romainfrancois

Not bad at all!

I've just run this with a random matrix that generates a specified number of columns or rows.

First, I replicate your results on my machine

library(microbenchmark)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(naniar)

bench <- function(n = 10000, data = airquality ){
  d <- slice(data, sample(seq_len(nrow(data)), n, replace = TRUE) )
  
  microbenchmark(
    par = add_n_miss(d, .parallel = TRUE),
    serial = add_n_miss(d, .parallel = FALSE)
  )
  
}

bench(1000)
#> Unit: microseconds
#>    expr    min      lq     mean median      uq     max neval cld
#>     par 52.942 57.9025 65.76265 61.558 65.9450 311.301   100   b
#>  serial 29.481 31.1645 42.74648 33.295 34.9585 896.108   100  a
bench(10000)
#> Unit: microseconds
#>    expr    min      lq     mean   median       uq     max neval cld
#>     par 76.999 88.2325 100.4541  95.0185 104.3805 208.656   100   a
#>  serial 82.775 90.5545 104.1372 105.4470 108.2400 205.606   100   a
bench(1e6)
#> Unit: milliseconds
#>    expr      min       lq     mean   median       uq       max neval cld
#>     par 1.661647 2.040611 5.532474 2.602328 4.120704 127.64148   100   a
#>  serial 5.201891 5.940883 8.247077 7.173730 9.186424  22.58627   100   a
bench(1e7)
#> Unit: milliseconds
#>    expr      min       lq     mean   median       uq      max neval cld
#>     par 16.74651 19.26304 28.79824 21.69526 32.25247 183.3326   100  a 
#>  serial 64.51201 71.60433 86.03827 83.04596 91.60811 248.2516   100   b

So far so good.

Next, I make a function to generate data of a particular row and column size with missing values


bench_rc <- function(col_size, row_size){
  
  n_elements <- col_size * row_size
  
  raw <- rnorm(n_elements)
  # set about 30% of values to NA
  raw[sample(x = 1:n_elements, size = floor(n_elements/3))] <- NA
  
  d <- matrix(data = raw,
                     nrow = col_size,
                     ncol = row_size) %>%
    tibble::as_tibble()
  
  
  microbenchmark(
    par = add_n_miss(d, .parallel = TRUE),
    serial = add_n_miss(d, .parallel = FALSE)
  )
}

bench_rc(col_size = 100,
         row_size = 100)
#> Unit: microseconds
#>    expr     min      lq     mean  median       uq     max neval cld
#>     par 106.446 128.939 156.3139 139.728 172.5495 372.372   100   b
#>  serial  95.549 114.790 127.0468 121.754 133.4420 222.849   100  a

bench_rc(col_size = 100,
         row_size = 1000)
#> Unit: microseconds
#>    expr     min       lq     mean   median       uq      max neval cld
#>     par 843.265 1148.941 1327.127 1306.700 1467.245 2039.393   100   b
#>  serial 756.137  862.329 1012.803 1005.898 1090.797 1845.178   100  a

bench_rc(col_size = 1000,
         row_size = 100)
#> Unit: microseconds
#>    expr     min       lq     mean   median       uq      max neval cld
#>     par 369.979 507.6235 640.2205 608.1715 735.2590 1373.724   100  a 
#>  serial 660.678 709.8110 808.2027 772.8890 862.1445 1495.670   100   b

So for 1000 rows, 100 columns, serial is faster

But for 100 rows, and 1000 columns parallel is faster.

This then carries through.

bench_rc(col_size = 1000,
         row_size = 1000)
#> Unit: milliseconds
#>    expr      min       lq     mean   median       uq      max neval cld
#>     par 4.482235 5.430947 9.121563 6.558558 8.801652 41.68146   100   a
#>  serial 6.632624 6.914668 7.993161 7.309024 7.968745 15.56544   100   a

bench_rc(col_size = 10000,
         row_size = 1000)
#> Unit: milliseconds
#>    expr      min       lq     mean   median       uq      max neval cld
#>     par 19.86186 24.51354 33.79760 26.56344 31.87995 151.3237   100  a 
#>  serial 53.86446 63.93033 70.24344 68.46899 73.44370 108.0184   100   b

bench_rc(col_size = 1000,
         row_size = 10000)
#> Unit: milliseconds
#>    expr      min        lq      mean    median        uq       max neval
#>     par 88.05169 118.87281 152.01282 126.62703 148.66614 418.81963   100
#>  serial 57.69362  69.94338  73.37348  72.58512  76.08338  96.75694   100
#>  cld
#>    b
#>   a

But it seems like there is a bit more to it...

But still, interesting! More to do.

Romain, do you think this behaviour makes sense?

I think what I need to do actually is set up a simulation study that goes through the number of columns and rows and compares them.

Nov 23 '17 00:11 njtierney

I need to revisit based on https://romain.rbind.io/blog/2017/10/24/fast-counting-na/ and the follow up discussion on twitter (that only affects numeric vectors though).

All bets are off when parallel code is involved, this is likely to be memory bound.

Nov 23 '17 09:11 romainfrancois

OK cool!

By all bets are off, do you mean that we should probably just pick a set number size for when to implement parallel code?

Nov 26 '17 23:11 njtierney

I'm keen to roll these cpp changes out this week into master - it seems like there is this number of around 10K - if the n_row x n_col is around 10K, then parallel is usually faster, but not always...any thoughts on this?

Dec 04 '17 03:12 njtierney

I think that in the interim I'm going to roll in @jimhester's rowMeans solution, just so I can get a nice speed boost until we work out the mechanics of how to get the cpp version working smoothly.

Dec 06 '17 22:12 njtierney

It'd be great @romainfrancois if we can try and get this merged into master by the end of March,

Do you have any thoughts on when to decide to use the parallel feature? At the moment i looks like if there are 10K elements (row x col), then parallel is usually faster?

Feb 28 '18 22:02 njtierney

Yep that seemed to be the threshold. I need to use isnan directly instead of Rcpp’s NumericVector::is_na, that shouls be faster in all cases.

Mar 01 '18 05:03 romainfrancois

OK awesome!

Mar 05 '18 23:03 njtierney

Will have a 👀 this week.

Mar 05 '18 23:03 romainfrancois

Hi @romainfrancois

I was wondering if you think we can wrap this up for the 0.4.0 release, which I am aiming for July 9ish, otherwise we can aim for the 0.5.0 release in August? I understand if you have other things that take priority, so I'm happy to take things from here if you like?

Jun 14 '18 06:06 njtierney

I'm going to retire this as at the moment I don't have the scope to manage a cpp branch of naniar, although a lot of great things came out of these discussions, so thank you, @romainfrancois :)

Apr 10 '23 07:04 njtierney