R6 icon indicating copy to clipboard operation
R6 copied to clipboard

Modifying large matrix field is slow in R6

Open johanneswaage opened this issue 4 years ago • 4 comments

A large-ish matrix is a part of an R6 object as a public or private field, and a update_matrix method allows updating a certain cell of the matrix. However, this operation is very slow compared to updating a normal matrix object outside R6, on the order of microseconds instead of nanoseconds. Reprex (for this example, the matrix is called NBA, and has customers x features):

library(bench)
library(ggplot2)
library(tidyr)

# Create NBA matrix, sparse
no_customers <- 1e6
no_features <- 30

NBA_matrix <-
  matrix(
    sample(c(rep(0, 1000), 1), size = no_customers * no_features, replace = TRUE),
    nrow = no_customers,
    ncol = no_features
  )

# Create NBA_like R6 object with matrix
library(R6)
NBA_lite <- R6Class("NBA_lite", class = FALSE, portable = FALSE, cloneable = FALSE, 
              public = list(
                mm = NULL,
                initialize = function(input_matrix) self$mm <- input_matrix,
                get_matrix = function() self$mm,
                modify = function(row, col, value) self$mm[row,col] <- value
              )
)
new_NBA_lite <- NBA_lite$new(input_matrix = NBA_matrix)

# Benchmark modifying single value, matrix vs R6 field
results <- bench::mark(matrix     = NBA_matrix[234123, 10] <- 2,
                       R6_method  = new_NBA_lite$modify(row = 234123, col = 10, value = 2),
                       R6_field   = new_NBA_lite$mm[234123, 10] <- 2)

expression median n_gc
NBA_matrix[234123, 10] <- 2 779ns 0
new_NBA_lite$modify(row = 234123, col = 10, value = 2) 126ms 1
new_NBA_lite$mm[234123, 10] <- 2 127ms 1

I'm wondering about this overhead of > 100ms - looking at the R6 performance vignette, there seems to be something other at play here - does the garbage collecting add this overhead and is it neccesary?

Thanks in advance

johanneswaage avatar Jan 29 '20 21:01 johanneswaage

Since you're using portable=F, you actually speed it up by using mm[row,col] <<- value instead of self$mm[row,col] <- value.

For example, here's a modified version of your code:

library(bench)

# Create NBA matrix, sparse
no_customers <- 1e6
no_features <- 30

NBA_matrix <-
  matrix(
    sample(c(rep(0, 1000), 1), size = no_customers * no_features, replace = TRUE),
    nrow = no_customers,
    ncol = no_features
  )

# Create NBA_like R6 object with matrix
library(R6)
NBA_lite <- R6Class("NBA_lite", class = FALSE, portable = FALSE, cloneable = FALSE, 
              public = list(
                mm = NULL,
                initialize = function(input_matrix) self$mm <- input_matrix,
                get_matrix = function() self$mm,
                modify = function(row, col, value) self$mm[row,col] <- value,
                modify2 = function(row, col, value) mm[row,col] <<- value
              )
)
new_NBA_lite <- NBA_lite$new(input_matrix = NBA_matrix)

results <- bench::mark(matrix     = NBA_matrix[234123, 10] <- 2,
                       R6_method  = new_NBA_lite$modify(row = 234123, col = 10, value = 2),
                       R6_method2 = new_NBA_lite$modify2(row = 234123, col = 10, value = 2),
                       R6_field   = new_NBA_lite$mm[234123, 10] <- 2)
results

Here's the result:

# A tibble: 4 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 matrix        689ns    816ns 1117403.      229MB      0   10000     0     8.95ms
2 R6_method   36.32ms  41.57ms      24.2     229MB     40.3     3     5    124.1ms
3 R6_method2   1.27µs   1.66µs  574883.         0B      0   10000     0    17.39ms
4 R6_field     37.5ms  43.55ms      22.1     229MB     27.7     4     5   180.79ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

The reason that this speeds things up is because of how the <- and <<- operators work in R.

When you do something like self$x <- y, that actually gets turned into something like this:

`*tmp*` <- x
x <- "$<-"(`*tmp*`, y)
rm(`*tmp*`)

This creates *tmp*, which initially points to the same object in memory as x. However, when the assignment to x happens in the second line, R makes a copy of the object and modifies it. This copy of the object needs to be GC'd (garbage collected) later, and that takes time. On the other hand, when you use x <<- y, that replaces x directly in place, without making a copy. See here for more info about subset assignment: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Subset-assignment

Here's another example that illustrates:

local({
  self <- environment()
  x <- 0
  y <- 0
  bench::mark(
    self$x <- x + 1,
    y <<- y +1,
    iterations = 1e5
  )
})
#> # A tibble: 2 x 13
#>   expression        min median `itr/sec` mem_alloc `gc/sec`  n_itr  n_gc total_time
#>   <bch:expr>      <bch> <bch:>     <dbl> <bch:byt>    <dbl>  <int> <dbl>   <bch:tm>
#> 1 self$x <- x + 1 490ns  595ns  1457224.        0B     14.6  99999     1     68.6ms
#> 2 y <<- y + 1     130ns  145ns  5800715.        0B      0   100000     0     17.2ms
#> # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

I think that, in your case, the performance penalty is even greater since you're using double subset assignment, with both $ and [.

If there's only a single reference to a vector and you change a value inside of it, R will modify the data structure in place. However, if there are multiple references to it, R needs to make a copy when doing the assignment. The *tmp* thing that R does causes a copy to be created, so there are multiple references to the object. The copies take time to create, and they take time to GC later on.

I'm leaving this issue open so that it reminds us to document this behavior.

wch avatar Feb 07 '20 16:02 wch

Hi Winston, Thank you very much for providing this insight into both R6 and R subset assignment as a whole - it works well on my end in a larger system as a whole. You might just have vindicated using R in production for my company ;) Best regards,

johanneswaage avatar Feb 10 '20 13:02 johanneswaage

@johanneswaage Great to hear!

wch avatar Feb 10 '20 20:02 wch

For future reference, here's a comparison of:

  • Using self vs. <<- in the assignment
  • Setting a single value vs. using subset (indexed) assignment

The short story is that there's a very small cost to using self when assigning to a single value, but when doing subset assignment, it can be expensive if the object is large (and a copy of the whole thing needs to be made).

new_obj <- function() {
  self <- environment()
  scalar <- 0
  vector <- 1:1e6
  
  list(
    set_scalar_noself = function(x) {
      scalar <<- x
    },
    set_scalar_self = function(x) {
      self$scalar <- x
    },
    set_vector_noself = function(i, x) {
      vector[i] <<- x
    },
    set_vector_self = function(i, x) {
      self$vector[i] <- x
    }
  )
}

obj <- new_obj()
microbenchmark::microbenchmark(
  obj$set_scalar_noself(12345),
  obj$set_scalar_self(12345),
  obj$set_vector_noself(1000, 12345),
  obj$set_vector_self(1000, 12345)
)
#> Unit: nanoseconds
#>                                expr    min        lq       mean  median        uq      max neval
#>        obj$set_scalar_noself(12345)    520     669.5    1771.58     955    2608.0    11003   100
#>          obj$set_scalar_self(12345)    977    1309.5    2727.64    1990    3440.5     8831   100
#>  obj$set_vector_noself(1000, 12345)   1058    1321.0    2591.27    2040    3004.5    11394   100
#>    obj$set_vector_self(1000, 12345) 991117 1326866.5 3412537.55 2477184 4884621.5 10733476   100

wch avatar Mar 16 '21 15:03 wch