DFTK.jl Threading discussion

trafficstars

Our threading is a bit all over the place right now.

Linalg threading is automatically taken care of, and can be improved by installing MKL.

I think FFT threading sucks more than it should, and I'm not sure what we can do about that, except maybe do benchmarks, compare to a C version, and report upstream. A good thing to try is to use MKL FFT (see https://github.com/JuliaMath/FFTW.jl#mkl) and compare performance. If it improves, it means that FFTW's (= julia's) threading sucks.

Already if we do the above two well, I believe we should scale relatively well. Julia threading by itself should be unnecessary (and even possibly counterproductive). Maybe just on a few big array operations (densities and potentials): to be determined by profiling once we take care of FFT threading. For those we should either do the loop ourselves or use a package such as https://github.com/Jutho/Strided.jl.

As a guess, I would say we can use that to get pretty good intra-node scaling. With this plus MPI for kpoints, I would be surprised if we can't get >50% efficiency on relatively large computations (eg 10 nodes each with 16 cores), which should be enough to tackle relatively big problems.

Nov 01 '20 14:11 antoine-levitt

So I did some very stupid benchmarking. 8x1x1 supercell of silicon, kerker mixing, Ecut 30, no kpoints. There is some variation in the number of hamiltonian applies, but not enough to affect the results by more than 10%. I don't use julia threads, only FFTW and BLAS.

With standard FFT, OpenBLAS: 1 thread 26s, 2 threads 19.5s With MKL FFT, BLAS: 1 thread 22.5s, 2 threads 17.3s

So MKL is slightly faster, but no significant impact on threading scaling.

So it looks like FFTs just don't scale. Looking at individual timings, even pure linalg functions do not scale perfectly (although better than FFTs). This is pretty surprising to me. So either my laptop sucks, or threading sucks. :-(

Based on this it seems julia-level threading might be the way to go after all? If julia and BLAS threading are not active at the same (which I think is already the case right now), we can just set FFT threads to 1, julia and BLAS to n. Essentially the only thing missing is a @threads on the bands loop in compute_partial_density. Then we benchmark and take a look.

Also we should look at kernels (GEMM, FFT) with sizes corresponding to test cases we care about and microbenchmark them to look at scalings.

Nov 01 '20 17:11 antoine-levitt

Julia threading is not much better here... I think it comes down to FFTs being bandwidth bound. Memory bandwidth doesn't scale on my CPU so nothing really makes any difference here. Results are likely to be different on a cluster.

Nov 01 '20 18:11 antoine-levitt

Of course it's highly implementation- and machine-dependent but basically FFTs are of the arithmetic intensity of ~1 flop per byte, which is about the crossing-point of the roofline model, so FFTs are bandwidth bound (or close to). I don't know how the memory of the cluster is laid out, but basically there's no hope of scaling to the full 16 cores of a node, no matter what we do. So MPI is the way to go, to use many nodes. GPUs might be a good alternative.

Nov 01 '20 18:11 antoine-levitt

With 2 kpoints and MPI it's about the same: I get a roughly 30% speedup, but not more. So I would say that's a pretty strong indicator that this is what's going on.

Nov 01 '20 19:11 antoine-levitt

If julia and BLAS threading are not active at the same (which I think is already the case right now)

No you have to actively make sure about that.

I have the rigourous results for our pre-MPI levels of threading we have right now. The data mostly agrees with what you said. Basically n_julia = n_blas and n_fftw = 1 is usually the best strategy, with the exception being cases with very large grids and few bands. In some cases n_fftw = 2 actually hurts performance a lot.

Nov 02 '20 09:11 mfherbst

No you have to actively make sure about that.

I mean that in the code both are not active at the same time. We thread on bands for the hamiltonian application (only the local potential and kinetic, the nonlocal is computed separately) and on kpoints for the densities. There are no BLAS calls inside these (at least no substantial one)

Nov 02 '20 09:11 antoine-levitt

Ah I see. Yes that's true.

Nov 02 '20 09:11 mfherbst

I do get perfect scaling on an in-place FFT of the same size as the one I used in the tests. So possibly we are limited by copies, zero-filling and indirect indexing in the shuffling of the G vectors.

Nov 02 '20 09:11 antoine-levitt

For sure that part is not parallel.

Nov 02 '20 09:11 mfherbst

OK, it's most likely the shuffle. Also there's something just weird with julia threads on my machine. Can you run the following gist on the machines you have available, with

JULIA_NUM_THREADS=1 julia scratch.jl 1
JULIA_NUM_THREADS=2 julia scratch.jl 1
JULIA_NUM_THREADS=1 julia scratch.jl 2

? https://gist.github.com/antoine-levitt/47bee24d1e3120c8f7b76c44323cc32e

Nov 02 '20 10:11 antoine-levitt

I'd be careful with the perm due to the closure issue in the Threads.@threads

Nov 02 '20 10:11 mfherbst

antoine@beta ~ $ JULIA_NUM_THREADS=1 julia scratch.jl 1
1 Julia threads
1 FFT threads
1 OP FFT (ms, nothread): 3.0656849999999998
1 OP FFT (ms,   thread): 3.0829690000000003
1 IP FFT (ms, nothread): 3.090488
1 IP FFT (ms,   thread): 3.097841
1 zero   (ms, nothread): 0.22622699999999998
1 zero   (ms,   thread): 0.234516
1 shuff  (ms, nothread): 2.405861
1 shuff  (ms,   thread): 2.458017
20 OP FFT (ms, nothread): 4.50236115
20 OP FFT (ms,   thread): 4.49560955
20 IP FFT (ms, nothread): 4.5057681
20 IP FFT (ms,   thread): 4.34425855
20 zero   (ms, nothread): 0.9001383999999999
20 zero   (ms,   thread): 0.9011944500000001
20 shuff  (ms, nothread): 3.9395521000000002
20 shuff  (ms,   thread): 3.99466725
antoine@beta ~ $ JULIA_NUM_THREADS=2 julia scratch.jl 1
2 Julia threads
1 FFT threads
1 OP FFT (ms, nothread): 3.092664
1 OP FFT (ms,   thread): 3.310207
1 IP FFT (ms, nothread): 3.094226
1 IP FFT (ms,   thread): 3.286885
1 zero   (ms, nothread): 0.227385
1 zero   (ms,   thread): 0.24474
1 shuff  (ms, nothread): 2.456639
1 shuff  (ms,   thread): 2.4547839999999996
20 OP FFT (ms, nothread): 4.4639671
20 OP FFT (ms,   thread): 2.74513365
20 IP FFT (ms, nothread): 4.4874934
20 IP FFT (ms,   thread): 2.7548755000000003
20 zero   (ms, nothread): 0.9148937500000001
20 zero   (ms,   thread): 0.9689609499999999
20 shuff  (ms, nothread): 3.77124165
20 shuff  (ms,   thread): 4.80794445
antoine@beta ~ $ JULIA_NUM_THREADS=1 julia scratch.jl 2
1 Julia threads
2 FFT threads
1 OP FFT (ms, nothread): 1.6778229999999998
1 OP FFT (ms,   thread): 1.698907
1 IP FFT (ms, nothread): 1.64182
1 IP FFT (ms,   thread): 1.662633
1 zero   (ms, nothread): 0.223545
1 zero   (ms,   thread): 0.232068
1 shuff  (ms, nothread): 2.39668
1 shuff  (ms,   thread): 2.454685
20 OP FFT (ms, nothread): 2.6543426500000002
20 OP FFT (ms,   thread): 2.6570268
20 IP FFT (ms, nothread): 2.65212585
20 IP FFT (ms,   thread): 2.6525905499999998
20 zero   (ms, nothread): 0.9025374500000001
20 zero   (ms,   thread): 0.9006238999999999
20 shuff  (ms, nothread): 3.8307086999999997
20 shuff  (ms,   thread): 3.86930885

Then again the shuffle is exaggerated here, since ours is ~16 times smaller

Nov 02 '20 10:11 antoine-levitt

It's fine, it's a constant here

Nov 02 '20 10:11 antoine-levitt

So, some conclusions:

Doing one FFT a lot of times is ~30% faster than doing many FFTs a lot of times. Probably a memory locality issue.
FFTs are ~5x slower than simply passing over the data
Shuffles are about the same cost as FFTs
FFT threading works relatively well. Both FFT and "band" threading work about equally well.
passing over the data doesn't scale at all. shuffling actually antiscales