DFTK.jl
DFTK.jl copied to clipboard
Threading discussion
Our threading is a bit all over the place right now.
Linalg threading is automatically taken care of, and can be improved by installing MKL.
I think FFT threading sucks more than it should, and I'm not sure what we can do about that, except maybe do benchmarks, compare to a C version, and report upstream. A good thing to try is to use MKL FFT (see https://github.com/JuliaMath/FFTW.jl#mkl) and compare performance. If it improves, it means that FFTW's (= julia's) threading sucks.
Already if we do the above two well, I believe we should scale relatively well. Julia threading by itself should be unnecessary (and even possibly counterproductive). Maybe just on a few big array operations (densities and potentials): to be determined by profiling once we take care of FFT threading. For those we should either do the loop ourselves or use a package such as https://github.com/Jutho/Strided.jl.
As a guess, I would say we can use that to get pretty good intra-node scaling. With this plus MPI for kpoints, I would be surprised if we can't get >50% efficiency on relatively large computations (eg 10 nodes each with 16 cores), which should be enough to tackle relatively big problems.
So I did some very stupid benchmarking. 8x1x1 supercell of silicon, kerker mixing, Ecut 30, no kpoints. There is some variation in the number of hamiltonian applies, but not enough to affect the results by more than 10%. I don't use julia threads, only FFTW and BLAS.
With standard FFT, OpenBLAS: 1 thread 26s, 2 threads 19.5s With MKL FFT, BLAS: 1 thread 22.5s, 2 threads 17.3s
So MKL is slightly faster, but no significant impact on threading scaling.
So it looks like FFTs just don't scale. Looking at individual timings, even pure linalg functions do not scale perfectly (although better than FFTs). This is pretty surprising to me. So either my laptop sucks, or threading sucks. :-(
Based on this it seems julia-level threading might be the way to go after all? If julia and BLAS threading are not active at the same (which I think is already the case right now), we can just set FFT threads to 1, julia and BLAS to n. Essentially the only thing missing is a @threads on the bands loop in compute_partial_density. Then we benchmark and take a look.
Also we should look at kernels (GEMM, FFT) with sizes corresponding to test cases we care about and microbenchmark them to look at scalings.
Julia threading is not much better here... I think it comes down to FFTs being bandwidth bound. Memory bandwidth doesn't scale on my CPU so nothing really makes any difference here. Results are likely to be different on a cluster.
Of course it's highly implementation- and machine-dependent but basically FFTs are of the arithmetic intensity of ~1 flop per byte, which is about the crossing-point of the roofline model, so FFTs are bandwidth bound (or close to). I don't know how the memory of the cluster is laid out, but basically there's no hope of scaling to the full 16 cores of a node, no matter what we do. So MPI is the way to go, to use many nodes. GPUs might be a good alternative.
With 2 kpoints and MPI it's about the same: I get a roughly 30% speedup, but not more. So I would say that's a pretty strong indicator that this is what's going on.
If julia and BLAS threading are not active at the same (which I think is already the case right now)
No you have to actively make sure about that.
I have the rigourous results for our pre-MPI levels of threading we have right now. The data mostly agrees with what you said. Basically n_julia = n_blas and n_fftw = 1 is usually the best strategy, with the exception being cases with very large grids and few bands. In some cases n_fftw = 2 actually hurts performance a lot.
No you have to actively make sure about that.
I mean that in the code both are not active at the same time. We thread on bands for the hamiltonian application (only the local potential and kinetic, the nonlocal is computed separately) and on kpoints for the densities. There are no BLAS calls inside these (at least no substantial one)
Ah I see. Yes that's true.
I do get perfect scaling on an in-place FFT of the same size as the one I used in the tests. So possibly we are limited by copies, zero-filling and indirect indexing in the shuffling of the G vectors.
For sure that part is not parallel.
OK, it's most likely the shuffle. Also there's something just weird with julia threads on my machine. Can you run the following gist on the machines you have available, with
JULIA_NUM_THREADS=1 julia scratch.jl 1
JULIA_NUM_THREADS=2 julia scratch.jl 1
JULIA_NUM_THREADS=1 julia scratch.jl 2
? https://gist.github.com/antoine-levitt/47bee24d1e3120c8f7b76c44323cc32e
I'd be careful with the perm due to the closure issue in the Threads.@threads
antoine@beta ~ $ JULIA_NUM_THREADS=1 julia scratch.jl 1
1 Julia threads
1 FFT threads
1 OP FFT (ms, nothread): 3.0656849999999998
1 OP FFT (ms, thread): 3.0829690000000003
1 IP FFT (ms, nothread): 3.090488
1 IP FFT (ms, thread): 3.097841
1 zero (ms, nothread): 0.22622699999999998
1 zero (ms, thread): 0.234516
1 shuff (ms, nothread): 2.405861
1 shuff (ms, thread): 2.458017
20 OP FFT (ms, nothread): 4.50236115
20 OP FFT (ms, thread): 4.49560955
20 IP FFT (ms, nothread): 4.5057681
20 IP FFT (ms, thread): 4.34425855
20 zero (ms, nothread): 0.9001383999999999
20 zero (ms, thread): 0.9011944500000001
20 shuff (ms, nothread): 3.9395521000000002
20 shuff (ms, thread): 3.99466725
antoine@beta ~ $ JULIA_NUM_THREADS=2 julia scratch.jl 1
2 Julia threads
1 FFT threads
1 OP FFT (ms, nothread): 3.092664
1 OP FFT (ms, thread): 3.310207
1 IP FFT (ms, nothread): 3.094226
1 IP FFT (ms, thread): 3.286885
1 zero (ms, nothread): 0.227385
1 zero (ms, thread): 0.24474
1 shuff (ms, nothread): 2.456639
1 shuff (ms, thread): 2.4547839999999996
20 OP FFT (ms, nothread): 4.4639671
20 OP FFT (ms, thread): 2.74513365
20 IP FFT (ms, nothread): 4.4874934
20 IP FFT (ms, thread): 2.7548755000000003
20 zero (ms, nothread): 0.9148937500000001
20 zero (ms, thread): 0.9689609499999999
20 shuff (ms, nothread): 3.77124165
20 shuff (ms, thread): 4.80794445
antoine@beta ~ $ JULIA_NUM_THREADS=1 julia scratch.jl 2
1 Julia threads
2 FFT threads
1 OP FFT (ms, nothread): 1.6778229999999998
1 OP FFT (ms, thread): 1.698907
1 IP FFT (ms, nothread): 1.64182
1 IP FFT (ms, thread): 1.662633
1 zero (ms, nothread): 0.223545
1 zero (ms, thread): 0.232068
1 shuff (ms, nothread): 2.39668
1 shuff (ms, thread): 2.454685
20 OP FFT (ms, nothread): 2.6543426500000002
20 OP FFT (ms, thread): 2.6570268
20 IP FFT (ms, nothread): 2.65212585
20 IP FFT (ms, thread): 2.6525905499999998
20 zero (ms, nothread): 0.9025374500000001
20 zero (ms, thread): 0.9006238999999999
20 shuff (ms, nothread): 3.8307086999999997
20 shuff (ms, thread): 3.86930885
Then again the shuffle is exaggerated here, since ours is ~16 times smaller
It's fine, it's a constant here
So, some conclusions:
- Doing one FFT a lot of times is ~30% faster than doing many FFTs a lot of times. Probably a memory locality issue.
- FFTs are ~5x slower than simply passing over the data
- Shuffles are about the same cost as FFTs
- FFT threading works relatively well. Both FFT and "band" threading work about equally well.
- passing over the data doesn't scale at all. shuffling actually antiscales
Will update the gist with a more realistic 16x smaller shuffling, don't run the tests yet
https://gist.github.com/antoine-levitt/ad0a5c079c0c21d5b5c201bb3463341a
1 Julia threads
1 FFT threads
1 OP FFT (ms, nothread): 3.0981020000000004
1 OP FFT (ms, thread): 3.112112
1 IP FFT (ms, nothread): 3.070499
1 IP FFT (ms, thread): 3.087192
1 zero (ms, nothread): 0.225986
1 zero (ms, thread): 0.232553
1 shuff (ms, nothread): 0.064763
1 shuff (ms, thread): 0.06961500000000001
20 OP FFT (ms, nothread): 4.40295365
20 OP FFT (ms, thread): 4.4727388999999995
20 IP FFT (ms, nothread): 4.4169778
20 IP FFT (ms, thread): 4.39876445
20 zero (ms, nothread): 0.9007288
20 zero (ms, thread): 0.9101356999999999
20 shuff (ms, nothread): 0.42482214999999995
20 shuff (ms, thread): 0.4216371
2 Julia threads
1 FFT threads
1 OP FFT (ms, nothread): 3.0806509999999996
1 OP FFT (ms, thread): 3.2894069999999997
1 IP FFT (ms, nothread): 3.072821
1 IP FFT (ms, thread): 3.280761
1 zero (ms, nothread): 0.22847800000000001
1 zero (ms, thread): 0.24864800000000004
1 shuff (ms, nothread): 0.06420000000000001
1 shuff (ms, thread): 0.072241
20 OP FFT (ms, nothread): 4.47046475
20 OP FFT (ms, thread): 2.7399812999999997
20 IP FFT (ms, nothread): 4.45843685
20 IP FFT (ms, thread): 2.73897325
20 zero (ms, nothread): 0.89145825
20 zero (ms, thread): 1.00910495
20 shuff (ms, nothread): 0.4233373
20 shuff (ms, thread): 0.38536115
1 Julia threads
2 FFT threads
1 OP FFT (ms, nothread): 1.673157
1 OP FFT (ms, thread): 1.682847
1 IP FFT (ms, nothread): 1.657648
1 IP FFT (ms, thread): 1.674772
1 zero (ms, nothread): 0.221874
1 zero (ms, thread): 0.23099
1 shuff (ms, nothread): 0.064282
1 shuff (ms, thread): 0.069161
20 OP FFT (ms, nothread): 2.6773622
20 OP FFT (ms, thread): 2.68048485
20 IP FFT (ms, nothread): 2.66627665
20 IP FFT (ms, thread): 2.66890895
20 zero (ms, nothread): 0.9238886
20 zero (ms, thread): 0.9316450000000001
20 shuff (ms, nothread): 0.42073515000000006
20 shuff (ms, thread): 0.4213948
So that doesn't really explain it. Shuffles don't scale, but they are marginal compared to the FFT grid operations.
My desktop (openblas, fftw):
1 Julia threads
1 FFT threads
1 OP FFT (ms, nothread): 5.699495
1 OP FFT (ms, thread): 5.786136
1 IP FFT (ms, nothread): 5.621367
1 IP FFT (ms, thread): 5.42435
1 zero (ms, nothread): 0.283626
1 zero (ms, thread): 0.302891
1 shuff (ms, nothread): 2.2300530000000003
1 shuff (ms, thread): 2.215383
20 OP FFT (ms, nothread): 5.7073655500000005
20 OP FFT (ms, thread): 5.840356750000001
20 IP FFT (ms, nothread): 5.8854368
20 IP FFT (ms, thread): 5.727769149999999
20 zero (ms, nothread): 0.4712422
20 zero (ms, thread): 0.47405585000000006
20 shuff (ms, nothread): 2.53749955
20 shuff (ms, thread): 2.50997615
1 Julia threads
2 FFT threads
1 OP FFT (ms, nothread): 2.819764
1 OP FFT (ms, thread): 2.85293
1 IP FFT (ms, nothread): 2.836836
1 IP FFT (ms, thread): 2.851259
1 zero (ms, nothread): 0.264587
1 zero (ms, thread): 0.271665
1 shuff (ms, nothread): 2.156283
1 shuff (ms, thread): 2.1864250000000003
20 OP FFT (ms, nothread): 2.99389355
20 OP FFT (ms, thread): 3.0105041999999997
20 IP FFT (ms, nothread): 3.0007459
20 IP FFT (ms, thread): 3.0353887
20 zero (ms, nothread): 0.4753942
20 zero (ms, thread): 0.47681375000000004
20 shuff (ms, nothread): 2.5315598500000003
20 shuff (ms, thread): 2.56014645
2 Julia threads
1 FFT threads
1 OP FFT (ms, nothread): 5.426804000000001
1 OP FFT (ms, thread): 5.597681000000001
1 IP FFT (ms, nothread): 5.4196100000000005
1 IP FFT (ms, thread): 5.5905499999999995
1 zero (ms, nothread): 0.259654
1 zero (ms, thread): 0.271335
1 shuff (ms, nothread): 2.179247
1 shuff (ms, thread): 2.223623
20 OP FFT (ms, nothread): 5.7493665
20 OP FFT (ms, thread): 3.0053435
20 IP FFT (ms, nothread): 5.76326915
20 IP FFT (ms, thread): 3.0073784
20 zero (ms, nothread): 0.47232145
20 zero (ms, thread): 0.444852
20 shuff (ms, nothread): 2.5365754000000003
20 shuff (ms, thread): 2.489903
About similar then
Well for me FFT threading suck more.
If you run it on the cluster can you also look at 1xN and Nx1 with N = 1 2 4 8 16? Setting nreps to 32 rather than 20 to avoid noise.
It's not that bad though, almost a 2x speedup.
OK so: FFTs do thread relatively well, on my laptop as well as on the cluster. They are the limiting step in the benchmarks I'm running. Then why do I get such crappy scaling?! Next step to debug this (it's going to drive me mad) is modifying the minibenchmark script to better reflect the computations we do, and see if that scales.
Thanks for looking into this. I'm also surprised about my bad scaling results for practical examples.
I thought maybe it had something to do with locking, but I checked and that's not the case
Meet mini-DFTK:
using BenchmarkTools
using FFTW
using Random
fft_threads = isempty(ARGS) ? 1 : parse(Int, ARGS[1])
FFTW.set_num_threads(fft_threads)
println("$(Threads.nthreads()) Julia threads")
println("$fft_threads FFT threads")
N1 = 120
N2 = 75
N3 = 75
nrep_max = 51
grid_ratio_factor = 27
NG = div(N1*N2*N3, grid_ratio_factor)
const as = [randn(ComplexF64, N1, N2, N3) for n = 1:nrep_max]
const bs = [randn(ComplexF64, N1, N2, N3) for n = 1:nrep_max]
const cs = [randn(ComplexF64, NG) for n = 1:nrep_max]
const ds = [randn(ComplexF64, NG) for n = 1:nrep_max]
const pot = randn(N1, N2, N3)
const kin = randn(NG)
const op = plan_fft(as[1])
const ip = plan_fft!(as[1])
const perm = Random.randperm(N1*N2*N3)[1:NG]
function bench_apply_threads(nreps)
Threads.@threads for n = 1:nreps
ψ = cs[n]
Hψ = ds[n]
ψ_real = as[n]
fill!(ψ_real, 0)
ψ_real[perm] .= ψ
mul!(ψ_real, ip, ψ_real)
ψ_real .*= pot
mul!(ψ_real, ip, ψ_real)
Hψ .= view(ψ_real, perm)
Hψ .+= kin .* ψ
end
end
function bench_ip_threads(nreps)
Threads.@threads for n = 1:nreps
mul!(as[n], ip, as[n])
end
end
nreps = nrep_max
println("$nreps applies (ms) : ", 1000/nreps*@belapsed bench_apply_threads($nreps))
println("$nreps 2xFFTs (ms) : ", 2000/nreps*@belapsed bench_ip_threads($nreps))
I took numbers from 3x2x2 silicon supercell, ecut 30. Btw in this case the factor between the two grids is pretty big, using optimize_fft_grid reduces it to 22.
Using this I get
antoine@beta ~ $ JULIA_NUM_THREADS=1 julia scratch.jl; JULIA_NUM_THREADS=2 julia scratch.jl
1 Julia threads
1 FFT threads
51 applies (ms) : 17.745557
51 2xFFTs (ms) : 14.747430941176473
2 Julia threads
1 FFT threads
51 applies (ms) : 12.675170470588235
51 2xFFTs (ms) : 8.688853137254902
so speedup factor going from 1 to 2 threads of ~1.4 for the apply, ~1.7 for the pure FFT, and ~1.25 for DFTK. I'm guessing single-core FFTs are essentially sitting on the fence on being bandwidth limited (as seen from the fact that it does take a bit more time for the applies than for the FFTs). Threading increases the FLOPs but not the bandwidth. DFTK is probably worse because there the memory is "cold" because other things have happened between iterations. Not really a lot we can do here... In any case if you know threading experts the above is a nice reproducer. With https://github.com/JuliaMolSim/DFTK.jl/pull/354 in it's exactly representative of the kinetic+local apply.
Crazily enough, FFT normalizations have a non-negligible impact!
With scaled plans, speedup of apply is ~1.33, speedup of FFTs is ~1.5. That's much closer to DFTK results.
So we just do the scaling manually?
Done :-)