multiversion icon indicating copy to clipboard operation
multiversion copied to clipboard

Add dispatcher that returns function pointer

Open fmcgg opened this issue 1 year ago • 3 comments

I found this functionality useful when creating a noise library and want to know if you would add something like it. Usecase here is I have many noise functions that each have their own multiversion derive. They are composed at runtime and called in series to fill an array with noise. Going through the dispatcher for each function call adds too much overhead, so I instead want to call the dispatcher once while constructing and get the function pointer so I can skip it when computing.

No intention to merge this, just easier to display through a pr.

fmcgg avatar Sep 14 '24 15:09 fmcgg

I could see something like this being useful. The only dispatcher overhead vs this, however, should be an atomic load. Is there really that much overhead?

calebzulawski avatar Sep 14 '24 16:09 calebzulawski

#![feature(portable_simd)]

use std::simd::prelude::*;

use multiversion::multiversion;

#[multiversion(targets = "simd", dispatcher = "indirect")]
pub fn indirect_add(res: Simd<f32, 8>) -> Simd<f32, 8> {
    res + Simd::splat(1.0)
}

#[multiversion(targets = "simd")]
pub fn benchmark_indirect() {
    let mut res = Simd::<f32, 8>::splat(0.0);
    for _ in 0..1000 {
        res = indirect_add(res);
        res = indirect_add(res);
        res = indirect_add(res);
        res = indirect_add(res);
        res = indirect_add(res);
        res = indirect_add(res);
        res = indirect_add(res);
        res = indirect_add(res);
        res = indirect_add(res);
    }
}

#[multiversion(targets = "simd", dispatcher = "pointer")]
pub fn pointer_add(res: Simd<f32, 8>) -> Simd<f32, 8> {
    res + Simd::splat(1.0)
}

#[multiversion(targets = "simd")]
pub fn benchmark_pointer() {
    let add_function = pointer_add();
    let mut res = Simd::<f32, 8>::splat(0.0);
    for _ in 0..1000 {
        unsafe {
            res = (add_function)(res);
            res = (add_function)(res);
            res = (add_function)(res);
            res = (add_function)(res);
            res = (add_function)(res);
            res = (add_function)(res);
            res = (add_function)(res);
            res = (add_function)(res);
            res = (add_function)(res);
        }
    }
}

indirect/lib            time:   [36.351 µs 36.420 µs 36.494 µs]
                        change: [-3.5538% -3.0421% -2.4988%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe

pointer/lib             time:   [23.154 µs 23.259 µs 23.360 µs]
                        change: [-2.4580% -1.5319% -0.1121%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Does this look correct? I don't know enough to comment on why, but I assumed it had to with atomics requiring going to the closest shared cache for each call. The gap grows the more calls are in the chain, but it goes away if you reduce it to just one call a for loop, maybe some compiler stuff idk...

fmcgg avatar Sep 19 '24 13:09 fmcgg

Some other concern I bumped into again when making the benchmark, indirect calls don't allow you to do const generics. I probably removed the condition without knowing why it was there, but it made my function interfaces very nice I think.

fmcgg avatar Sep 19 '24 13:09 fmcgg