Femi comments

Results 44 comments of


                                            Femi

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-pls`

Flamegraphs from profiling attached. I haven't studied profiling or flamegraphs to build a firm foundation yet so I can't provide any insights. Each profile was run for 1min. [Linfa_pls.zip](https://github.com/rust-ml/linfa/files/10018806/Linfa_pls.zip)

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-pls`

[par_azip](https://docs.rs/ndarray/latest/ndarray/macro.par_azip.html) can be used in place of zip in our code as this is where a significant amount of time is spent. Regression-Nipals-5feats-100_000 would benefit from the above the screenshots...

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-pls`

Our param_guard code ``` /// Performs checking step and calls `fit` on the checked hyperparameters. If checking failed, the /// checking error is converted to the original error type of...

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-pls`

Hmm did I interpret the flamegraph wrong? I thought since it was at the top most of the CPU time was spent there or I guess it's possible that it's...

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear`

Flamegraphs from profiling attached. I haven't studied profiling or flamegraphs to build a firm foundation yet so I can't provide any insights. Each profile was run for 1min. [Linfa_linear.zip](https://github.com/rust-ml/linfa/files/10018808/Linfa_linear.zip)

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear`

did a quick review. 10 feets GLM 100_000 samples is spending most of its CPU time at this step `ndarray::zip::Zip::for_each` this is called twice. Both calls share an ancestor of...

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear`

Also if we use the rayon backend we could look into using [par_azip](https://docs.rs/ndarray/latest/ndarray/macro.par_azip.html) instead of zip. It is the parallel version.

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear`

I think calling multiplication less often would likely be the bigger performance boost. I can locally test enabling the `matrixmultiply-threading` feature. I didn't profile the BLAS version. Is this desirable...

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear`

Okay I'll 2 more. One with blas and one with that the matrixmultiply feature.

Investigate discrepancy between non-BLAS and BLAS versions of `linfa-linear`

![flamegraph](https://user-images.githubusercontent.com/47154698/202498347-98b8b03a-de5b-4c20-b8ae-fd133c04db1f.svg) Blas flamegraph a lil different then the other