Add dispatcher that returns function pointer
I found this functionality useful when creating a noise library and want to know if you would add something like it. Usecase here is I have many noise functions that each have their own multiversion derive. They are composed at runtime and called in series to fill an array with noise. Going through the dispatcher for each function call adds too much overhead, so I instead want to call the dispatcher once while constructing and get the function pointer so I can skip it when computing.
No intention to merge this, just easier to display through a pr.
I could see something like this being useful. The only dispatcher overhead vs this, however, should be an atomic load. Is there really that much overhead?
#![feature(portable_simd)]
use std::simd::prelude::*;
use multiversion::multiversion;
#[multiversion(targets = "simd", dispatcher = "indirect")]
pub fn indirect_add(res: Simd<f32, 8>) -> Simd<f32, 8> {
res + Simd::splat(1.0)
}
#[multiversion(targets = "simd")]
pub fn benchmark_indirect() {
let mut res = Simd::<f32, 8>::splat(0.0);
for _ in 0..1000 {
res = indirect_add(res);
res = indirect_add(res);
res = indirect_add(res);
res = indirect_add(res);
res = indirect_add(res);
res = indirect_add(res);
res = indirect_add(res);
res = indirect_add(res);
res = indirect_add(res);
}
}
#[multiversion(targets = "simd", dispatcher = "pointer")]
pub fn pointer_add(res: Simd<f32, 8>) -> Simd<f32, 8> {
res + Simd::splat(1.0)
}
#[multiversion(targets = "simd")]
pub fn benchmark_pointer() {
let add_function = pointer_add();
let mut res = Simd::<f32, 8>::splat(0.0);
for _ in 0..1000 {
unsafe {
res = (add_function)(res);
res = (add_function)(res);
res = (add_function)(res);
res = (add_function)(res);
res = (add_function)(res);
res = (add_function)(res);
res = (add_function)(res);
res = (add_function)(res);
res = (add_function)(res);
}
}
}
indirect/lib time: [36.351 µs 36.420 µs 36.494 µs]
change: [-3.5538% -3.0421% -2.4988%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) low mild
5 (5.00%) high mild
1 (1.00%) high severe
pointer/lib time: [23.154 µs 23.259 µs 23.360 µs]
change: [-2.4580% -1.5319% -0.1121%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
Does this look correct? I don't know enough to comment on why, but I assumed it had to with atomics requiring going to the closest shared cache for each call. The gap grows the more calls are in the chain, but it goes away if you reduce it to just one call a for loop, maybe some compiler stuff idk...
Some other concern I bumped into again when making the benchmark, indirect calls don't allow you to do const generics. I probably removed the condition without knowing why it was there, but it made my function interfaces very nice I think.