sprs icon indicating copy to clipboard operation
sprs copied to clipboard

Parallel Matrix Operations

Open aujxn opened this issue 2 years ago • 3 comments

Currently only sparse by sparse products are parallel in the smmp module. Converting the current sparse by dense products using ndarray::parallel should be straight forward. Here is an implementation for par_csr_mulacc_dense_colmaj that gives a significant speedup on my machine:

pub fn par_csr_mulacc_dense_colmaj<'a, N, A, B, I, Iptr>(
    lhs: CsMatViewI<A, I, Iptr>,
    rhs: ArrayView<B, Ix2>,
    mut out: ArrayViewMut<'a, N, Ix2>,
) where
    A: Send + Sync,
    B: Send + Sync,
    N: 'a + crate::MulAcc<A, B>  + Send + Sync,
    I: 'a + SpIndex,
    Iptr: 'a + SpIndex,
{
    assert_eq!(lhs.cols(), rhs.shape()[0], "Dimension mismatch");
    assert_eq!(lhs.rows(), out.shape()[0], "Dimension mismatch");
    assert_eq!(rhs.shape()[1], out.shape()[1], "Dimension mismatch");
    assert!(lhs.is_csr(), "Storage mismatch");

    let axis1 = Axis(1);
    ndarray::Zip::from(out.axis_iter_mut(axis1))
        .and(rhs.axis_iter(axis1))
        .par_for_each(|mut ocol, rcol| {
            for (orow, lrow) in lhs.outer_iterator().enumerate() {
                let oval = &mut ocol[[orow]];
                for (rrow, lval) in lrow.iter() {
                    let rval = &rcol[[rrow]];
                    oval.mul_acc(lval, rval);
                }
            }
        });
}

The only changes here are the parallel iterator, adding the rayon feature for ndarray, and adding the Sync and Send trait bounds to the data types inside the matrices. My concern is that adding Send + Sync will result in these trait requirements to be unnecessarily added in many places.

Looking at the impl Mul for CsMatBase and CsMatBase I see that Sync + Send is required no matter if multi_thread is enabled or not. Is it okay to propagate these trait requirements all the way up to many of the trait impls for CsMatBase and then use the conditional feature compilation on the lowest level functions found in the prod module? Conditionally compiling at all the higher level implementations sounds like it would get nasty very quickly, especially as more parallel features get added.

aujxn avatar Apr 22 '22 21:04 aujxn

This would be perfect as a PR! I see no problem adding the Send+Sync bounds, this is for a matrix element which would be quite surprising to see as anything else. We will just have to add the trait bound where necessary, maybe add these bounds to CsMatBase?

mulimoen avatar Apr 23 '22 06:04 mulimoen

This implementation isn't ideal but should be better in many sparse by dense products. In certain cases like when the right hand side is a dense vector this is actually still sequential. To optimize it might be necessary to make a parallel iterator of the CsMatBase::outer_iterator, but I don't know if it is possible to zip parallel iterators from sprs with parallel iterators from ndarray using their zip combinator. I don't have any experience actually implementing parallel iterators but I'll look into what it will take.

It will probably also be worth building some more benchmarks to compare the difference but I'm not sure if Criterion can do benchmark comparisons for different conditional compilation directives.

aujxn avatar Apr 26 '22 20:04 aujxn

I had need of parallel {CSR matrix} x { dense vector} mult, so I coded it using rayon. It was .... a nice speedup.

Um, I'm [ETA: NOT] sufficiently experienced with Rust to know how to go about integrating this into sprs, but then again, the code is so tiny that I think it wouldn't be difficult to just rewrite it in the right way to integrate with what you've already done.

Pointer: https://github.com/chetmurthy/qrusty/blob/f7e47f3a2e9a3d6bec9599de5076e91690bc66c0/qrusty/src/accel.rs#L336

Disclaimer: I really don't have a good idea of how to work with and extend traits, so I didn't even try to make this compatible with the existing traits, but I know that that is something that needs doing.

Disc#2: the code isn't as efficient as it could be: viz. it could assign the results to the final locations, but instead concats a bunch of fragment-vectors.

chetmurthy avatar Jul 22 '22 17:07 chetmurthy