Peroxide Optimization after refactoring to AD in 0.30

Optimization after refactoring to AD in 0.30 - example failing

Open pavhofman opened this issue 3 years ago • 10 comments

Hi, thanks a lot for the great library.

I tried running the example in optimization.rs in 0.30.6 and it does not optimize, keeping the initial value in p output. Using 0.29.1 - its example (i.e. before refactoring from numbers to AD) works OK.

Thanks a lot,

Pavel.

Apr 01 '21 08:04 pavhofman

@pavhofman Thanks for the good point! It is caused from operation rule of AD. In example, quad function is given as follows:

fn quad(x: &Vec<f64>, n: Vec<AD>) -> Option<Vec<AD>> {
    Some(
        x.clone().into_iter()
            .map(|t| AD::from(t))
            .map(|t| t.pow(n[0]))
            .collect()
    )
}

See AD::from(t). AD::from is the function which has the signature - f64 -> AD0. Then in next step, t.pow(n[0]) represents AD0.pow(AD0 or AD1 or AD2). But any arithmetic operations between AD enums always return lower order. Thus, t.pow(n[0]) returns AD0 type. And AD0 has zero derivative. Therefore, we can't optimize with above function.

But we can optimize with below function!

fn quad(x: &Vec<f64>, n: Vec<AD>) -> Option<Vec<AD>> {
    Some(
        x.clone().into_iter()
            .map(|t| AD1(t, 0f64))
            .map(|t| t.pow(n[0]))
            .collect()
    )
}

Because of the type AD1, it works! (I'll update this example in next version)

BTW, which syntax seems better? If you choose second one or suggest any other idea, then I'll implement it in next version.

// First one - Current Version (`0.30.6`)
fn quad(x: &Vec<f64>, n: Vec<AD>) -> Option<Vec<AD>> {
    Some(
        x.clone().into_iter()
            .map(|t| AD1(t, 0f64))
            .map(|t| t.pow(n[0]))
            .collect()
    )
}

// Second one - Candidate of future version
fn quad(x: &Vec<f64>, n: Vec<AD>) -> Option<Vec<AD>> {
    Some(
        x.clone().into_iter()
            .map(|t| t.to_ad1())
            .map(|t| t.pow(n[0]))
            .collect()
    )
}

Apr 01 '21 16:04 Axect

Hi, thank you very much for very fast fix. I will try it tomorrow.

As of the choices - I would probably prefer the t.to_ad1() version as it does not require any extra constant in the API.

I am a newbie to Rust. Today I tried hard (and always failed) to write a closure which would define the optimized function in one place - to be used both for calculating the measured or precise value and for running the optimization - t.pow(n[0]) in your example. So far all of my optimization uses (in Octave/Matlab) involved other variables too which means a closure instead of a function (the same in octave too).

Please would that be possible to consider an option or API for using one function/closure for this

// Noise to image let mut y = x.fmap(|t| t.powi(2));

and this:

fn quad(x: &Vec, n: Vec<AD>) -> Option<Vec<AD>> { Some( x.clone().into_iter() .map(|t| AD1(t, 0f64)) .map(|t| t.pow(n[0])) <---- here .collect() ) }

Just the part inside the iteration. Often the optimized functions are very long and having them at two places is a bit inconvenient and error-prone.

Thanks a lot for your great effort.

With regards,

Pavel.

Apr 01 '21 19:04 pavhofman

Thanks for great suggestion. I also felt inconvenience of optimization in peroxide. In fact, the optimized function in one place can be achieved by generic function.

#[macro_use]
extern crate peroxide;
use peroxide::fuga::*;

fn main() {
    // ...

    // Noise to image
    let mut y = x.fmap(|t| pow_fn(t, &vec![2f64]));  // <---- use pow_fn

	// ...
	
	// Optimizer setting
	let mut opt = Optimizer::new(data, pow_optim);

	// ...
}

// One generic function (`Real` = `f64` or `AD`)
fn pow_fn<T: Real>(x: T, n: &Vec<T>) -> T {
    x.pow(n[0])
}

// Boiler plate to use optimize
fn pow_optim(x: &Vec<f64>, n: Vec<AD>) -> Option<Vec<AD>> {
    Some(
        x.clone().into_iter()
            .map(|t| AD1(t, 0f64))
            .map(|t| pow_fn(t, &n)) // <---- use pow_fn
            .collect()
    )
}

But it seems uncomfortable because of boiler plate. So, there should be improvement of API. Maybe next code seems more comfortable.

// Candidate
fn main() {
	// ...

	// Optimizer setting
	let mut opt = Optimizer::new(data, pow_fn_optim);
}

#[optimize]
fn pow_fn<T: Real>(x: T, n: &Vec<T>) -> T {
	x.pow(n[0])
}

#[optimize] automatically generates fn pow_fn_optim(x: &Vec<f64>, n: Vec<AD>) -> Option<Vec<AD>> via procedural macro. In this case, you should use FUNCTIONNAME_optim to construct optimizer.

I'll implement above functionality in next version. (Maybe 0.31.0)

Thank you once again for great suggestions! :smile:

Apr 01 '21 22:04 Axect

Hi, thanks a lot for your great support.

The updated example works very good now. I am just afraid the move to AD has introduced some extra wrapping in the critical path which affects performance. This is comparison of your example in 0.29 and 0.30 for the 100 points as in the example (time measured by measure_time crate just for the optimization run + printout:

{
        print_time!("{:?}", "Regression 0.30");
        let p = opt.set_init_param(n_init)
            .set_max_iter(200)
            .set_method(LevenbergMarquardt)
            .optimize();
        p.print();                  // Optimized parameter
        opt.get_error().print();    // Optimized RMSE
  }

target/release/test_rust 
[2.0002]
101.4596320642305
"Regression 0.30" took 4.10994ms

target/release/test_rust 
[1.9994]
113.37364941459128
"Regression 0.29" took 2.861277ms

And for 10,000 points (let mut x = seq(0., 99., 0.01f64);):

target/release/test_rust 
[2.0000]
100.30175592699732
"Regression 0.30" took 325.714667ms

target/release/test_rust 
[2.0001]
100.5700492713708
"Regression 0.29" took 264.950832ms

Please is there any way to do something about the performance regression? I am afraid it is a major slowdown, especially for the shorter runs. My application is embedded/soft real-time and every ms counts, hence the choice of rust :-)

Thank you very much for your suggestion of the generic function. IIUC there are no generic closures in Rust. The problem is all my optimization functions require additional (fixed, non-optimized) parameters which I do not know how to pass without using a closure instead of a function.

I am sorry for posting so many questions :-) Thanks a lot for your patience.

Best regards,

Pavel.

Apr 02 '21 11:04 pavhofman

Maybe there are four ways to achieve high performance.

Reduce the number of iterations : In this case, 50 max iters are enough.
Reduce the number of wrappings : As you mentioned, extra wrapping makes slower. So, I'll implement direct operations between f64 and AD soon. For the time being, below code makes faster.

// Current version
fn quad(x: &Vec<f64>, n: Vec<AD>) -> Option<Vec<AD>> {
    Some(
        x.clone().into_iter()
            .map(|t| pow_temp(t, n[0]))
            .collect()
    )
}

#[inline]
fn pow_temp(x: f64, y: AD) -> AD {
    AD1(x.powf(y.x()), x.powf(y.x()) * x.ln() * y.dx())
}

For the next version, you will be able to use below code.

// Future version
fn quad(x: &Vec<f64>, n: Vec<AD>) -> Option<Vec<AD>> {
    Some(
        x.clone().into_iter()
            .map(|t| t.pow(n[0])) // <-- Direct computation
            .collect()
    )
}

Change the structure of AD : Until 0.28.2, peroxide used proc_macro to generate AD structures. It guaranteed perfect runtime performance, but it required egregious compile time. Thus, after 0.30.0 I changed the structure of AD to enums. Which do you prefer? It's helpful to refer #38 .
JAAF : Jacobian As A Function : In current version, Optimizer requires obtain jacobian at every step. Though, it is obtained by automatic differentiation, it decreases performance. If jacobian can be pre-calculated at first step to semi-symbolic form via algebraic expression evaluation engine (e.g. fasteval), then huge improvements of performance is expected. I plan to implement this in the future.

Thanks for great benchmark. The results have alarmed me. I will think about performance more seriously.

Apr 02 '21 12:04 Axect

Thanks! Regarding the jacobian - would that be possible to enter the jacobian matrix externally just like the optimized function, i.e. a closure? From what I have seen most engines allow that - e.g. optimset ('dfdp', fDfdp) in octave, gradient in optimization-engine https://github.com/alphaville/optimization-engine/blob/master/examples/panoc_ex1.rs#L30 etc. Automated calculation as default, but with an option of user-provided function/closure.

From my POW the compilation time is not critical but runtime is. Cross-compiled on a powerful PC, running on an embedded ARM. Other users may have it reversed but probably not.

I think I found a way to make the quad function a closure and use it for both creating measured values and running the optimization. The one-time conversion from AD when generating y is no overhead.

    let gain = 5.0;

    let quad_closure = |x: &Vec<f64>, n: Vec<AD>| -> Option<Vec<AD>> {
        Some(
            x.clone().into_iter()
                .map(|t| AD1(t, 0f64))
                .map(|t| gain * t.pow(n[0]))
                .collect()
        )
    };
    // Noise to image
    let n = vec![AD0(2.0)];
    let mut y = quad_closure(&x, n).unwrap().into_iter().map(|t| t.x()).collect();
....
    let mut opt = Optimizer::new(data, quad_closure);

Apr 02 '21 12:04 pavhofman

Actually I am comparing re-compilation times of my example in v. 0.28.1 and v. 0.30.6 and do not notice any difference. The crate itself is compiled only once, no problem waiting a bit, IMO.

Finished release [optimized] target(s) in 1.52s
real	0m1,651s
user	0m4,328s
sys	0m0,205s

Finished release [optimized] target(s) in 1.49s
real	0m1,616s
user	0m4,195s
sys	0m0,230s
``

Apr 02 '21 13:04 pavhofman

As of the iteration runs - some engines have an optional parameter (+ some tiny default) for a minimum change ratio of the parameter between iterations ((e.g. less than 0.1%, or even as a vector, value for each params) and if an iteration produces change in all params below their minimum change ratio, it stops. Typically the convergence is much faster than 50 runs - below 20 in your example. Perhaps that could be a way towards a major speedup. Estimating the required number of iterations upfront is difficult, IMO.

Sorry for taking so much of your time :-)

Apr 02 '21 13:04 pavhofman

I have reread my posts - it sounds like I am just requesting features instead of sending patches. I have just started with rust but I can try adding e.g. the minimum change option and send a PR for review/rework, if you agree. Please just let me know if your planned changes are done so that I could start with the latest code. Thanks a lot, I very much appreciate your kind help and effort.

Apr 02 '21 19:04 pavhofman

Thanks! Regarding the jacobian - would that be possible to enter the jacobian matrix externally just like the optimized function, i.e. a closure? From what I have seen most engines allow that - e.g. optimset ('dfdp', fDfdp) in octave, gradient in optimization-engine https://github.com/alphaville/optimization-engine/blob/master/examples/panoc_ex1.rs#L30 etc. Automated calculation as default, but with an option of user-provided function/closure.

Good suggestion! For future version, manual gradient function(or closure) can be allowed. (Unfortunately, in current version, it's impossible)

Actually I am comparing re-compilation times of my example in v. 0.28.1 and v. 0.30.6 and do not notice any difference. The crate itself is compiled only once, no problem waiting a bit, IMO.
Finished release [optimized] target(s) in 1.52s
real	0m1,651s
user	0m4,328s
sys	0m0,205s
Finished release [optimized] target(s) in 1.49s
real	0m1,616s
user	0m4,195s
sys	0m0,230s
``

That is because of implementation. Until 0.29, optimization has depended on Number not AD. So, in 0.28.1, although there is AD, but optimization does not depend on AD. Of course, simple enums in stack are also fast, maybe the gap is not significant. Until Rust's compile time improves, I think it's okay to keep the status quo.

As of the iteration runs - some engines have an optional parameter (+ some tiny default) for a minimum change ratio of the parameter between iterations ((e.g. less than 0.1%, or even as a vector, value for each params) and if an iteration produces change in all params below their minimum change ratio, it stops. Typically the convergence is much faster than 50 runs - below 20 in your example. Perhaps that could be a way towards a major speedup. Estimating the required number of iterations upfront is difficult, IMO.

Oh.. I forgot implementation of stop condition. I already implemented that feature in ode which has similar api. That is a good point!

Sorry for taking so much of your time :-)

Don't be, these discussions also meaningful to me :smile:

I have reread my posts - it sounds like I am just requesting features instead of sending patches. I have just started with rust but I can try adding e.g. the minimum change option and send a PR for review/rework, if you agree. Please just let me know if your planned changes are done so that I could start with the latest code. Thanks a lot, I very much appreciate your kind help and effort.

Just requesting features is also helpful enough for peroxide. Of course, PR is always welcome! If I finish the modification of operation between f64 and AD, then I'll leave a comment on this thread.

Thanks for your great interests & suggestions! :+1:

Apr 03 '21 12:04 Axect

Peroxide Peroxide copied to clipboard

Optimization after refactoring to AD in 0.30 - example failing

Peroxide
Peroxide copied to clipboard