vctrs icon indicating copy to clipboard operation
vctrs copied to clipboard

Bypass vec_proxy if it cannot be implemented in constant time?

Open mjskay opened this issue 2 years ago • 3 comments

I've been working on a "random variable" datatype in posterior. This datatype is currently a zero-length list that stores a multidimensional array as an attribute; this attribute represents a multidimensional random array. This somewhat odd container format is the best I've been able to come up with so far that lets us hide internals from users and be compatible with tibbles and data frames and such. The first element of the internal array indexes draws from the distribution; this element is largely hidden from the user; basically [ and [[ are implemented so that slicing operations like x[i,j] become things like x[,i,j] internally. This makes it so that we can implement fast operations on these arrays, turning arithmetic on random variables into array operations and whatnot.

One drawback of this format is that making a proxy of this datatype is not a constant-time operation, as it has to be split into a list of a bunch of arrays along the second index of this internal array. I'll get back to that in a moment.

Anyway, using some nice stuff in tibble and pillar and whatnot we can also put these rvars into tibble and get nice output:

library(posterior)   # devtools::install_github("stan-dev/posterior")
#> This is posterior version 1.0.0
#> 
#> Attaching package: 'posterior'
#> The following objects are masked from 'package:stats':
#> 
#>     mad, sd, var
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

m = 1:50
x = rvar_rng(rnorm, 50, m)  # 50 Normal(m, 1) random variables
x
#> rvar<4000>[50] mean ± sd:
#>  [1]  1.0 ± 0.99   1.9 ± 0.99   3.0 ± 1.00   4.0 ± 0.99   5.0 ± 0.99 
#>  [6]  6.0 ± 1.00   7.0 ± 1.00   8.0 ± 1.00   9.0 ± 0.99  10.0 ± 1.00 
#> [11] 11.0 ± 0.99  12.0 ± 1.01  13.0 ± 0.99  14.0 ± 1.00  15.0 ± 1.02 
#> [16] 16.0 ± 1.01  17.0 ± 0.99  18.0 ± 1.02  19.0 ± 1.00  20.0 ± 1.01 
#> [21] 21.0 ± 1.00  22.0 ± 1.02  23.0 ± 0.97  24.0 ± 0.99  25.0 ± 0.99 
#> [26] 26.0 ± 1.01  27.0 ± 1.02  28.0 ± 1.00  29.0 ± 0.99  30.0 ± 1.00 
#> [31] 31.0 ± 1.00  32.0 ± 0.98  33.0 ± 1.03  34.0 ± 1.01  35.0 ± 0.99 
#> [36] 36.0 ± 1.00  37.0 ± 1.01  38.0 ± 1.00  39.0 ± 0.98  40.0 ± 0.99 
#> [41] 41.0 ± 1.03  42.0 ± 1.00  43.0 ± 0.98  44.0 ± 1.00  45.0 ± 1.01 
#> [46] 46.0 ± 1.00  47.0 ± 1.01  48.0 ± 0.99  49.0 ± 1.01  50.0 ± 1.01

df = tibble(m, x)
df
#> # A tibble: 50 x 2
#>        m            x
#>    <int>       <rvar>
#>  1     1   1.0 ± 0.99
#>  2     2   1.9 ± 0.99
#>  3     3   3.0 ± 1.00
#>  4     4   4.0 ± 0.99
#>  5     5   5.0 ± 0.99
#>  6     6   6.0 ± 1.00
#>  7     7   7.0 ± 1.00
#>  8     8   8.0 ± 1.00
#>  9     9   9.0 ± 0.99
#> 10    10  10.0 ± 1.00
#> # ... with 40 more rows

Created on 2021-07-11 by the reprex package (v2.0.0)

That all has been working quite well (thanks for everyone's work on this ecosystem btw!). However, I discovered recently that some operations on rvars are quite slow. I first discovered that dplyr::group_split() and base split() are both surprisingly slow and memory hungry. However, this is only true when the rvar is stored in a tibble --- notice the first expression below (tibble) is slow and the second (data.frame) fast:

bench::mark(split(df, df$m), split(as.data.frame(df), df$m), check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 split(df, df$m)                606.78ms 606.78ms      1.65  466.25MB     21.4
#> 2 split(as.data.frame(df), df$m)   5.21ms   5.77ms    162.      3.13MB     22.0

I realized this all came back to vec_slice(): for most data structures, vec_slice(x, i) for scalar i should be a constant-time operation, yet if vec_proxy() is not constant-time it will be lower-bounded by the performance of vec_proxy(). Throw that into any algorithm that calls vec_slice() multiple times and things can blow up quickly.

We can see the runtime of a single slice is proportional to vector size, which makes sense since vec_proxy()'s runtime is as well:

bench::press(n = seq(1, 500, by = 50), {
  x = rvar_rng(rnorm, n)
  bench::mark(vctrs::vec_slice(x, 1))
})
#> # A tibble: 10 x 7
#>    expression                 n      min   median `itr/sec` mem_alloc `gc/sec`
#>    <bch:expr>             <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#>  1 vctrs::vec_slice(x, 1)     1  583.2us  632.5us   1497.     709.5KB     10.6
#>  2 vctrs::vec_slice(x, 1)    51   10.3ms   10.9ms     90.8     11.1MB     28.4
#>  3 vctrs::vec_slice(x, 1)   101   24.2ms   25.6ms     39.6     21.7MB    211. 
#>  4 vctrs::vec_slice(x, 1)   151     45ms     45ms     22.2     32.4MB    244. 
#>  5 vctrs::vec_slice(x, 1)   201   57.3ms   59.3ms     16.6     43.1MB     18.5
#>  6 vctrs::vec_slice(x, 1)   251     76ms   81.5ms      9.67    53.8MB     21.3
#>  7 vctrs::vec_slice(x, 1)   301     94ms   95.3ms     10.5     64.5MB     14.0
#>  8 vctrs::vec_slice(x, 1)   351  113.3ms  144.1ms      7.02    75.2MB     19.3
#>  9 vctrs::vec_slice(x, 1)   401  133.2ms    137ms      7.13    85.9MB     17.8
#> 10 vctrs::vec_slice(x, 1)   451  158.6ms  159.1ms      5.71    96.6MB     17.1

I've implemented slicing for rvars manually, so when [ is called directly (presumably by data.frames but not by tibbles), the operation is constant time, as expected:

bench::press(n = seq(1, 500, by = 50), {
  x = rvar_rng(rnorm, n)
  bench::mark(x[1])
})
#> # A tibble: 10 x 7
#>    expression     n      min   median `itr/sec` mem_alloc `gc/sec`
#>    <bch:expr> <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#>  1 x[1]           1   64.9us   70.8us    13102.   155.7KB     16.8
#>  2 x[1]          51   65.9us   71.3us    13387.    62.6KB     17.1
#>  3 x[1]         101   64.7us   71.7us    13389.    62.6KB     16.7
#>  4 x[1]         151   66.3us   71.1us    13213.    62.6KB     18.9
#>  5 x[1]         201   65.7us     71us    13478.    62.6KB     21.2
#>  6 x[1]         251   65.7us   71.6us    13390.    62.6KB     27.8
#>  7 x[1]         301   65.8us   71.2us    13368.    62.6KB     30.2
#>  8 x[1]         351   65.4us   72.6us    13303.    62.6KB     49.1
#>  9 x[1]         401   64.2us   68.1us    13952.    62.6KB     32.4
#> 10 x[1]         451   64.9us   69.3us    13614.    62.6KB     23.4

I guess I am wondering what the solution is. The simple thing for me would be if vec_slice() were generic I could overload it.

But the bigger question is, is the expectation that vec_proxy() be a constant time operation? I don't know what else is using it so I don't know what else to check. Perhaps it would help if all operations that use vec_proxy() in a way that worsens their time complexity if it is not constant to be made generic and listed somewhere.

Anyway, any help is appreciated. Thanks!

mjskay avatar Jul 11 '21 07:07 mjskay

The simple thing for me would be if vec_slice() were generic I could overload it.

This is not currently on the agenda for vctrs because the goal is to be generic through proxies. We do fall back to [ for classes that don't implement vctrs methods though. But this fallback mechanism has limitations and inconsistencies that we don't plan to fix.

is the expectation that vec_proxy() be a constant time operation?

I think so. For your data structure it seems like a more efficient proxy from the point of view of vctrs would be an array where the first dimension indexes observations rather than samples.

I wonder if we could add a customisation point for specifying which index represents observations. This way you wouldn't have to transform your data. Do you think that would be feasible @DavisVaughan?

lionel- avatar Jul 12 '21 08:07 lionel-

I think so. For your data structure it seems like a more efficient proxy from the point of view of vctrs would be an array where the first dimension indexes observations rather than samples.

Makes sense. Yeah, if the draws-first format weren't already commonly used in lots of packages we could be more flexible on that part of the format.

I wonder if we could add a customisation point for specifying which index represents observations. This way you wouldn't have to transform your data

If that's possible that would be great!

Something else that just occurred to me as a stop gap I can implement in the meantime is to cache rvar proxies. At least then if vec_slice is used repeatedly on the same object in some algorithm (which is what I assume is happening with split) the cost of the proxy is only incurred once. It wouldn't solve all the potential issues but might help.

mjskay avatar Jul 13 '21 08:07 mjskay

In case it is helpful to others encountering this problem, I used a simple caching approach in https://github.com/stan-dev/posterior/commit/c5036ffdc73fdb05da7199096fb1848c5586e6d3 to improve runtimes on algorithms that call vec_slice() repeatedly. There are still disadvantages compared to a constant-time proxy (namely, singular slicing operations are still O(n) instead of O(1) and this increases memory usage of the object by about 2x when used with vctrs functions), but in my informal tests it drastically improved the speed of things like dplyr::group_split() and tidyr::pivot_wider() (by about 10x).

e.g. before:

library(posterior)
library(dplyr)

make_df = function() {
  m = 1:50
  x = rvar_rng(rnorm, 50, m)  # 50 Normal(m, 1) random variables
  tibble(m, x)
}

bench::mark(
  tibble = {df = make_df(); split(df, df$m)}, 
  data.frame = {df = make_df(); split(as.data.frame(df), df$m)},
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tibble      638.7ms  638.7ms      1.57   480.6MB     20.4
#> 2 data.frame   27.1ms   31.8ms     28.9     16.9MB     23.4

Created on 2021-07-13 by the reprex package (v2.0.0)

After:

library(posterior)
library(dplyr)

make_df = function() {
  m = 1:50
  x = rvar_rng(rnorm, 50, m)  # 50 Normal(m, 1) random variables
  tibble(m, x)
}

bench::mark(
  tibble = {df = make_df(); split(df, df$m)}, 
  data.frame = {df = make_df(); split(as.data.frame(df), df$m)},
  check = FALSE
)
#> # A tibble: 2 x 13
#>   expression    min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 tibble     38.2ms 39.3ms      25.2    15.3MB     5.04   
#> 2 data.frame 21.3ms 23.3ms      42.2    10.7MB     7.45

Created on 2021-07-13 by the reprex package (v2.0.0)

Any further help still appreciated of course, especially if it were possible for a constant-time proxy to be handled here, e.g. using @lionel-'s suggestion.

mjskay avatar Jul 13 '21 21:07 mjskay