dace
dace copied to clipboard
How to optimize DaCe SpMV? (unoptimized version 20x slower than SciPy sparse dot)
Problem description
In the DaCe paper, it is stated that DaCe SpMV is as fast as MKL:
We observe similar results in SpMV, which is more complicated to optimize due to its irregular memory access characteristics. SDFGs are on par with MKL (99.9% performance) on CPU, and are successfully vectorized on GPUs.
However, I found it 20x slower than scipy.sparse.csr_matrix.dot. 1.4s vs 60ms for the problem size used in the DaCe paper.
- Full code to reproduce: https://gist.github.com/learning-chip/1ef56f6ea707b063c3177e9f143f0905
- The SpMV code was taken from https://github.com/spcl/dace/blob/v0.10.8/samples/simple/spmv.py
- DaCe version: 0.10.8
- Hardware: Intel Xeon 8180 CPU
It is probably because I did not apply any transformations to optimize performance. But I could not find more words in the paper about SpMV optimization, except this short paragraph:
Using explicit dataflow is beneficial when defining nontrivial data accesses. Fig. 4 depicts a full implementation of Sparse Matrix-Vector multiplication (SpMV). In the implementation, the access x[A_col[j]] is translated into an indirect access subgraph (see Appendix F) that can be identified and used in transformations.
I also tried this sample code from the paper, slightly different from the GitHub version. But got TypeError: dtype must be a DaCe type, got __map_8_b0 at runtime.
# From DaCe paper fig.4
@dace.program
def spmv(A_row: dace.uint32[H+1], A_col: dace.uint32[nnz],
A_val: dace.float32 [nnz], x: dace.float32 [W],
b: dace.float32[H]):
for i in dace.map[0:H]:
for j in dace.map[A_row[i]:A_row[i+1]]:
with dace.tasklet :
a << A_val[j]
in_x << x[A_col[j]]
out >> b(1, dace.sum)[i]
out = a * in_x
Environment
Conda environment.yml is
name: dace
dependencies:
- python=3.7.5
- pip
- pip:
- jupyterlab
- scipy==1.6.2
- dace==0.10.8
For reference, the SciPy sparse dot code is extremely simple:
template <class I, class T>
void csr_matvec(const I n_row,
const I n_col,
const I Ap[],
const I Aj[],
const T Ax[],
const T Xx[],
T Yx[])
{
for(I i = 0; i < n_row; i++){
T sum = Yx[i];
for(I jj = Ap[i]; jj < Ap[i+1]; jj++){
sum += Ax[jj] * Xx[Aj[jj]];
}
Yx[i] = sum;
}
}
Taken from https://github.com/scipy/scipy/blob/v1.6.2/scipy/sparse/sparsetools/csr.h#L1119-L1135
@learning-chip Thanks for providing the reproducible notebook. It made everything easier 👍
First of all, what you were timing includes compilation and running time. This was fixed since the latest release (should be part of the upcoming v0.10.9, which also includes Python 3.9 support). If you run the code through Binder with the latest master branch the time goes down to 16.4ms, but the first run will be slower since it will compile the code once.
If you want to use the current 0.10.8 release, you can also precompile the SDFG yourself:
cspmv = spmv.compile()
%time cspmv(A_row=A_row, A_col=A_col, A_val=A_val, x=x, b=b, H=H, W=W, nnz=nnz)
The second part of the question is the transformations. If I recall correctly, the set of transformations we applied was tiling the internal map and using vectorization to improve the runtime.
If you run the code through Binder with the latest master branch the time goes down to 16.4ms
Thanks, it now runs as fast as SciPy on my machine.
the set of transformations we applied was tiling the internal map and using vectorization to improve the runtime.
Anything beyond loop tiling and SIMD? Since SpMV is a typical memory-bound kernel, can DaCe insert software prefetch instructions such as _mm_prefetch for x86 CPU? Or is it left to the C++ compiler to recognize prefetch-able patterns?
Great question! We don't provide this transformation as part of our standard set, but it's possible to write a simple transformation that gets the memory access patterns from the memlets and inserts a prefetch tasklet. We would be happy to accept any contribution as a PR :)
it's possible to write a simple transformation that gets the memory access patterns from the memlets and inserts a prefetch tasklet.
Thanks, will come back if I have any updates 👍
Oh I forgot to ask -- where can I find the exact code for loop tiling and SIMD transformations, which were used by the SpMV benchmark in the DaCe paper? Those seem to be non-trivial optimizations, as tiling and SIMD for sparse operations are not as obvious as for GEMM.
Hi @learning-chip, while I did not contribute to the original paper you mentioned, I guess you should try standard tiling and vectorization transformations, which can be found [here] and [here]. They should be general enough for dynamic maps as well. Also maybe have a look at [SubgraphFusion] for loop/operator fusion. You can apply transformations either in the GUI or via code as stated in this tutorial. I do not know what exactly they did in the OG paper (@tbennun) to achieve said results, however.
Hello @learning-chip. We do have the optimized SpMV SDFG. Unfortunately, it is based on a quite old version of the SDFG specification, and it is not valid according to the current one. Therefore, I attempted a quick update so that we can at least showcase the optimizations that we did:

The optimizations are basically two:
- Vectorization of the columns map (
for j in dace.map[A_row[i]:A_row[i+1]]) as @sancierra suggested. On the left inner map, we performvector_length(e.g., 4) indirections ofxat a time and store them in a temporary vector, which is then multiplied byA_val[j:j+vector_length], while the result is reduced down and added tob. The right map handles the remainder in caseA_row[i+1]-A_row[i]does not divide evenly byvector_length. - Output local storage for
bin the form ofb_tmpso that the threads don't add directly toband potentially cause low performance due to cache invalidations.
I am attaching the (updated) SDFG and a script to run it below (in the zip file); however, there are a couple of caveats:
First, it will not validate as-is because we have changed how edges with a write-conflict-resolution work. These edges used to have a property called wcr_identity that effectively initialized the data they were pointing to at the start of the current scope.
This feature was used to initialize b_tmp to zero at the start of every iteration of the row map (for i in dace.map[0:H]). Since wcr_identity was removed (for reasons that might be out of this discussion's scope), b_tmp is not initialized and garbage values are returned. A temporary quick fix is to do a very simple change to generated code:
#pragma omp parallel for
for (auto i = 0; i < H; i += 1) {
unsigned int rowptr;
unsigned int rowend;
float b_tmp; // <---- SET THIS TO ZERO (float b_tmp = 0;)
Second, there might be some performance regression (I am not sure, though) due to the use of atomics in the edges with write-conflict resolution. However, it is easy to disable if needed.
I hope this helps!
@sancierra @alexnick83 Wonderful, thanks for the detailed replies! Let me take a closer look.