dace icon indicating copy to clipboard operation
dace copied to clipboard

How to optimize DaCe SpMV? (unoptimized version 20x slower than SciPy sparse dot)

Open learning-chip opened this issue 3 years ago • 9 comments

Problem description

In the DaCe paper, it is stated that DaCe SpMV is as fast as MKL:

We observe similar results in SpMV, which is more complicated to optimize due to its irregular memory access characteristics. SDFGs are on par with MKL (99.9% performance) on CPU, and are successfully vectorized on GPUs.

However, I found it 20x slower than scipy.sparse.csr_matrix.dot. 1.4s vs 60ms for the problem size used in the DaCe paper.

  • Full code to reproduce: https://gist.github.com/learning-chip/1ef56f6ea707b063c3177e9f143f0905
  • The SpMV code was taken from https://github.com/spcl/dace/blob/v0.10.8/samples/simple/spmv.py
  • DaCe version: 0.10.8
  • Hardware: Intel Xeon 8180 CPU

It is probably because I did not apply any transformations to optimize performance. But I could not find more words in the paper about SpMV optimization, except this short paragraph:

Using explicit dataflow is beneficial when defining nontrivial data accesses. Fig. 4 depicts a full implementation of Sparse Matrix-Vector multiplication (SpMV). In the implementation, the access x[A_col[j]] is translated into an indirect access subgraph (see Appendix F) that can be identified and used in transformations.

I also tried this sample code from the paper, slightly different from the GitHub version. But got TypeError: dtype must be a DaCe type, got __map_8_b0 at runtime.

# From DaCe paper fig.4
@dace.program
def spmv(A_row: dace.uint32[H+1], A_col: dace.uint32[nnz],
            A_val: dace.float32 [nnz], x: dace.float32 [W],
            b: dace.float32[H]):
    for i in dace.map[0:H]:
        for j in dace.map[A_row[i]:A_row[i+1]]:
            with dace.tasklet :
                a << A_val[j]
                in_x << x[A_col[j]]
                out >> b(1, dace.sum)[i]
                out = a * in_x

Environment

Conda environment.yml is

name: dace
dependencies:
  - python=3.7.5
  - pip
  - pip:
    - jupyterlab
    - scipy==1.6.2
    - dace==0.10.8

learning-chip avatar May 13 '21 08:05 learning-chip

For reference, the SciPy sparse dot code is extremely simple:

template <class I, class T>
void csr_matvec(const I n_row,
                const I n_col,
                const I Ap[],
                const I Aj[],
                const T Ax[],
                const T Xx[],
                      T Yx[])
{
    for(I i = 0; i < n_row; i++){
        T sum = Yx[i];
        for(I jj = Ap[i]; jj < Ap[i+1]; jj++){
            sum += Ax[jj] * Xx[Aj[jj]];
        }
        Yx[i] = sum;
    }
}

Taken from https://github.com/scipy/scipy/blob/v1.6.2/scipy/sparse/sparsetools/csr.h#L1119-L1135

learning-chip avatar May 13 '21 09:05 learning-chip

@learning-chip Thanks for providing the reproducible notebook. It made everything easier 👍

First of all, what you were timing includes compilation and running time. This was fixed since the latest release (should be part of the upcoming v0.10.9, which also includes Python 3.9 support). If you run the code through Binder with the latest master branch the time goes down to 16.4ms, but the first run will be slower since it will compile the code once.

If you want to use the current 0.10.8 release, you can also precompile the SDFG yourself:

cspmv = spmv.compile()
%time cspmv(A_row=A_row, A_col=A_col, A_val=A_val, x=x, b=b, H=H, W=W, nnz=nnz)

The second part of the question is the transformations. If I recall correctly, the set of transformations we applied was tiling the internal map and using vectorization to improve the runtime.

tbennun avatar May 13 '21 10:05 tbennun

If you run the code through Binder with the latest master branch the time goes down to 16.4ms

Thanks, it now runs as fast as SciPy on my machine.

the set of transformations we applied was tiling the internal map and using vectorization to improve the runtime.

Anything beyond loop tiling and SIMD? Since SpMV is a typical memory-bound kernel, can DaCe insert software prefetch instructions such as _mm_prefetch for x86 CPU? Or is it left to the C++ compiler to recognize prefetch-able patterns?

learning-chip avatar May 13 '21 11:05 learning-chip

Great question! We don't provide this transformation as part of our standard set, but it's possible to write a simple transformation that gets the memory access patterns from the memlets and inserts a prefetch tasklet. We would be happy to accept any contribution as a PR :)

tbennun avatar May 13 '21 12:05 tbennun

it's possible to write a simple transformation that gets the memory access patterns from the memlets and inserts a prefetch tasklet.

Thanks, will come back if I have any updates 👍

learning-chip avatar May 13 '21 13:05 learning-chip

Oh I forgot to ask -- where can I find the exact code for loop tiling and SIMD transformations, which were used by the SpMV benchmark in the DaCe paper? Those seem to be non-trivial optimizations, as tiling and SIMD for sparse operations are not as obvious as for GEMM.

learning-chip avatar May 13 '21 13:05 learning-chip

Hi @learning-chip, while I did not contribute to the original paper you mentioned, I guess you should try standard tiling and vectorization transformations, which can be found [here] and [here]. They should be general enough for dynamic maps as well. Also maybe have a look at [SubgraphFusion] for loop/operator fusion. You can apply transformations either in the GUI or via code as stated in this tutorial. I do not know what exactly they did in the OG paper (@tbennun) to achieve said results, however.

sancierra avatar May 24 '21 08:05 sancierra

Hello @learning-chip. We do have the optimized SpMV SDFG. Unfortunately, it is based on a quite old version of the SDFG specification, and it is not valid according to the current one. Therefore, I attempted a quick update so that we can at least showcase the optimizations that we did:

optimized_spmv_cpu

The optimizations are basically two:

  • Vectorization of the columns map (for j in dace.map[A_row[i]:A_row[i+1]]) as @sancierra suggested. On the left inner map, we perform vector_length (e.g., 4) indirections of x at a time and store them in a temporary vector, which is then multiplied by A_val[j:j+vector_length], while the result is reduced down and added to b. The right map handles the remainder in case A_row[i+1]-A_row[i] does not divide evenly by vector_length.
  • Output local storage for b in the form of b_tmp so that the threads don't add directly to b and potentially cause low performance due to cache invalidations.

I am attaching the (updated) SDFG and a script to run it below (in the zip file); however, there are a couple of caveats:

First, it will not validate as-is because we have changed how edges with a write-conflict-resolution work. These edges used to have a property called wcr_identity that effectively initialized the data they were pointing to at the start of the current scope. This feature was used to initialize b_tmp to zero at the start of every iteration of the row map (for i in dace.map[0:H]). Since wcr_identity was removed (for reasons that might be out of this discussion's scope), b_tmp is not initialized and garbage values are returned. A temporary quick fix is to do a very simple change to generated code:

            #pragma omp parallel for
            for (auto i = 0; i < H; i += 1) {
                unsigned int rowptr;
                unsigned int rowend;
                float b_tmp; // <---- SET THIS TO ZERO (float b_tmp = 0;)

Second, there might be some performance regression (I am not sure, though) due to the use of atomics in the edges with write-conflict resolution. However, it is easy to disable if needed.

I hope this helps!

optimized_spmv_cpu.zip

alexnick83 avatar May 24 '21 21:05 alexnick83

@sancierra @alexnick83 Wonderful, thanks for the detailed replies! Let me take a closer look.

learning-chip avatar May 25 '21 02:05 learning-chip