vctrs
vctrs copied to clipboard
Bypass vec_proxy if it cannot be implemented in constant time?
I've been working on a "random variable" datatype in posterior. This datatype is currently a zero-length list that stores a multidimensional array as an attribute; this attribute represents a multidimensional random array. This somewhat odd container format is the best I've been able to come up with so far that lets us hide internals from users and be compatible with tibbles and data frames and such. The first element of the internal array indexes draws from the distribution; this element is largely hidden from the user; basically [
and [[
are implemented so that slicing operations like x[i,j]
become things like x[,i,j]
internally. This makes it so that we can implement fast operations on these arrays, turning arithmetic on random variables into array operations and whatnot.
One drawback of this format is that making a proxy of this datatype is not a constant-time operation, as it has to be split into a list of a bunch of arrays along the second index of this internal array. I'll get back to that in a moment.
Anyway, using some nice stuff in tibble
and pillar
and whatnot we can also put these rvar
s into tibble and get nice output:
library(posterior) # devtools::install_github("stan-dev/posterior")
#> This is posterior version 1.0.0
#>
#> Attaching package: 'posterior'
#> The following objects are masked from 'package:stats':
#>
#> mad, sd, var
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
m = 1:50
x = rvar_rng(rnorm, 50, m) # 50 Normal(m, 1) random variables
x
#> rvar<4000>[50] mean ± sd:
#> [1] 1.0 ± 0.99 1.9 ± 0.99 3.0 ± 1.00 4.0 ± 0.99 5.0 ± 0.99
#> [6] 6.0 ± 1.00 7.0 ± 1.00 8.0 ± 1.00 9.0 ± 0.99 10.0 ± 1.00
#> [11] 11.0 ± 0.99 12.0 ± 1.01 13.0 ± 0.99 14.0 ± 1.00 15.0 ± 1.02
#> [16] 16.0 ± 1.01 17.0 ± 0.99 18.0 ± 1.02 19.0 ± 1.00 20.0 ± 1.01
#> [21] 21.0 ± 1.00 22.0 ± 1.02 23.0 ± 0.97 24.0 ± 0.99 25.0 ± 0.99
#> [26] 26.0 ± 1.01 27.0 ± 1.02 28.0 ± 1.00 29.0 ± 0.99 30.0 ± 1.00
#> [31] 31.0 ± 1.00 32.0 ± 0.98 33.0 ± 1.03 34.0 ± 1.01 35.0 ± 0.99
#> [36] 36.0 ± 1.00 37.0 ± 1.01 38.0 ± 1.00 39.0 ± 0.98 40.0 ± 0.99
#> [41] 41.0 ± 1.03 42.0 ± 1.00 43.0 ± 0.98 44.0 ± 1.00 45.0 ± 1.01
#> [46] 46.0 ± 1.00 47.0 ± 1.01 48.0 ± 0.99 49.0 ± 1.01 50.0 ± 1.01
df = tibble(m, x)
df
#> # A tibble: 50 x 2
#> m x
#> <int> <rvar>
#> 1 1 1.0 ± 0.99
#> 2 2 1.9 ± 0.99
#> 3 3 3.0 ± 1.00
#> 4 4 4.0 ± 0.99
#> 5 5 5.0 ± 0.99
#> 6 6 6.0 ± 1.00
#> 7 7 7.0 ± 1.00
#> 8 8 8.0 ± 1.00
#> 9 9 9.0 ± 0.99
#> 10 10 10.0 ± 1.00
#> # ... with 40 more rows
Created on 2021-07-11 by the reprex package (v2.0.0)
That all has been working quite well (thanks for everyone's work on this ecosystem btw!). However, I discovered recently that some operations on rvar
s are quite slow. I first discovered that dplyr::group_split()
and base split()
are both surprisingly slow and memory hungry. However, this is only true when the rvar
is stored in a tibble --- notice the first expression below (tibble) is slow and the second (data.frame) fast:
bench::mark(split(df, df$m), split(as.data.frame(df), df$m), check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 split(df, df$m) 606.78ms 606.78ms 1.65 466.25MB 21.4
#> 2 split(as.data.frame(df), df$m) 5.21ms 5.77ms 162. 3.13MB 22.0
I realized this all came back to vec_slice()
: for most data structures, vec_slice(x, i)
for scalar i
should be a constant-time operation, yet if vec_proxy()
is not constant-time it will be lower-bounded by the performance of vec_proxy()
. Throw that into any algorithm that calls vec_slice()
multiple times and things can blow up quickly.
We can see the runtime of a single slice is proportional to vector size, which makes sense since vec_proxy()
's runtime is as well:
bench::press(n = seq(1, 500, by = 50), {
x = rvar_rng(rnorm, n)
bench::mark(vctrs::vec_slice(x, 1))
})
#> # A tibble: 10 x 7
#> expression n min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 vctrs::vec_slice(x, 1) 1 583.2us 632.5us 1497. 709.5KB 10.6
#> 2 vctrs::vec_slice(x, 1) 51 10.3ms 10.9ms 90.8 11.1MB 28.4
#> 3 vctrs::vec_slice(x, 1) 101 24.2ms 25.6ms 39.6 21.7MB 211.
#> 4 vctrs::vec_slice(x, 1) 151 45ms 45ms 22.2 32.4MB 244.
#> 5 vctrs::vec_slice(x, 1) 201 57.3ms 59.3ms 16.6 43.1MB 18.5
#> 6 vctrs::vec_slice(x, 1) 251 76ms 81.5ms 9.67 53.8MB 21.3
#> 7 vctrs::vec_slice(x, 1) 301 94ms 95.3ms 10.5 64.5MB 14.0
#> 8 vctrs::vec_slice(x, 1) 351 113.3ms 144.1ms 7.02 75.2MB 19.3
#> 9 vctrs::vec_slice(x, 1) 401 133.2ms 137ms 7.13 85.9MB 17.8
#> 10 vctrs::vec_slice(x, 1) 451 158.6ms 159.1ms 5.71 96.6MB 17.1
I've implemented slicing for rvar
s manually, so when [
is called directly (presumably by data.frame
s but not by tibble
s), the operation is constant time, as expected:
bench::press(n = seq(1, 500, by = 50), {
x = rvar_rng(rnorm, n)
bench::mark(x[1])
})
#> # A tibble: 10 x 7
#> expression n min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 x[1] 1 64.9us 70.8us 13102. 155.7KB 16.8
#> 2 x[1] 51 65.9us 71.3us 13387. 62.6KB 17.1
#> 3 x[1] 101 64.7us 71.7us 13389. 62.6KB 16.7
#> 4 x[1] 151 66.3us 71.1us 13213. 62.6KB 18.9
#> 5 x[1] 201 65.7us 71us 13478. 62.6KB 21.2
#> 6 x[1] 251 65.7us 71.6us 13390. 62.6KB 27.8
#> 7 x[1] 301 65.8us 71.2us 13368. 62.6KB 30.2
#> 8 x[1] 351 65.4us 72.6us 13303. 62.6KB 49.1
#> 9 x[1] 401 64.2us 68.1us 13952. 62.6KB 32.4
#> 10 x[1] 451 64.9us 69.3us 13614. 62.6KB 23.4
I guess I am wondering what the solution is. The simple thing for me would be if vec_slice()
were generic I could overload it.
But the bigger question is, is the expectation that vec_proxy()
be a constant time operation? I don't know what else is using it so I don't know what else to check. Perhaps it would help if all operations that use vec_proxy()
in a way that worsens their time complexity if it is not constant to be made generic and listed somewhere.
Anyway, any help is appreciated. Thanks!
The simple thing for me would be if vec_slice() were generic I could overload it.
This is not currently on the agenda for vctrs because the goal is to be generic through proxies. We do fall back to [
for classes that don't implement vctrs methods though. But this fallback mechanism has limitations and inconsistencies that we don't plan to fix.
is the expectation that vec_proxy() be a constant time operation?
I think so. For your data structure it seems like a more efficient proxy from the point of view of vctrs would be an array where the first dimension indexes observations rather than samples.
I wonder if we could add a customisation point for specifying which index represents observations. This way you wouldn't have to transform your data. Do you think that would be feasible @DavisVaughan?
I think so. For your data structure it seems like a more efficient proxy from the point of view of vctrs would be an array where the first dimension indexes observations rather than samples.
Makes sense. Yeah, if the draws-first format weren't already commonly used in lots of packages we could be more flexible on that part of the format.
I wonder if we could add a customisation point for specifying which index represents observations. This way you wouldn't have to transform your data
If that's possible that would be great!
Something else that just occurred to me as a stop gap I can implement in the meantime is to cache rvar proxies. At least then if vec_slice is used repeatedly on the same object in some algorithm (which is what I assume is happening with split) the cost of the proxy is only incurred once. It wouldn't solve all the potential issues but might help.
In case it is helpful to others encountering this problem, I used a simple caching approach in https://github.com/stan-dev/posterior/commit/c5036ffdc73fdb05da7199096fb1848c5586e6d3 to improve runtimes on algorithms that call vec_slice()
repeatedly. There are still disadvantages compared to a constant-time proxy (namely, singular slicing operations are still O(n) instead of O(1) and this increases memory usage of the object by about 2x when used with vctrs functions), but in my informal tests it drastically improved the speed of things like dplyr::group_split()
and tidyr::pivot_wider()
(by about 10x).
e.g. before:
library(posterior)
library(dplyr)
make_df = function() {
m = 1:50
x = rvar_rng(rnorm, 50, m) # 50 Normal(m, 1) random variables
tibble(m, x)
}
bench::mark(
tibble = {df = make_df(); split(df, df$m)},
data.frame = {df = make_df(); split(as.data.frame(df), df$m)},
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tibble 638.7ms 638.7ms 1.57 480.6MB 20.4
#> 2 data.frame 27.1ms 31.8ms 28.9 16.9MB 23.4
Created on 2021-07-13 by the reprex package (v2.0.0)
After:
library(posterior)
library(dplyr)
make_df = function() {
m = 1:50
x = rvar_rng(rnorm, 50, m) # 50 Normal(m, 1) random variables
tibble(m, x)
}
bench::mark(
tibble = {df = make_df(); split(df, df$m)},
data.frame = {df = make_df(); split(as.data.frame(df), df$m)},
check = FALSE
)
#> # A tibble: 2 x 13
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
#> 1 tibble 38.2ms 39.3ms 25.2 15.3MB 5.04
#> 2 data.frame 21.3ms 23.3ms 42.2 10.7MB 7.45
Created on 2021-07-13 by the reprex package (v2.0.0)
Any further help still appreciated of course, especially if it were possible for a constant-time proxy to be handled here, e.g. using @lionel-'s suggestion.