fflas-ffpack
fflas-ffpack copied to clipboard
further improve PLUQ performance
jgdumas@hpac > ./benchmark-fgemm -p 0 Time: 0.908587 Gflops: 17.6098 -q 131071 -m 2000 -k 2000 -n 2000 -w -1 -i 3 -p 0 -t 32 -b 32 jgdumas@hpac > ./benchmark-ftrsm -p 0 Time: 0.549372 Gflops: 14.5621 -q 1009 -m 2000 -n 2000 -i 3 -f "" -g "" -t 32 -b 32 -p 0 jgdumas@hpac > ./benchmark-pluq -p 0 Time: 0.960592 Gflops: 5.55213 -q 131071 -m 2000 -n 2000 -r 2000 -i 3 -v 0 -t 1 -b 1 -p N
I have investigated on how to improve the permutations. The code is in branch regressionPLUQ. I wrote a short note on the alternative algorithms and experiments : http://membres-liglab.imag.fr/pernet/Publications/pluq_permut.pdf
Note that things are already much better (more than a times 2 speed-up for pluq) in the current release
jgdumas@hpac > ./benchmark-fgemm -p 0 Time: 0.911721 Gflops: 17.5492 -q 131071 -m 2000 -k 2000 -n 2000 -w -1 -i 3 -p 0 -t 32 -b 32 jgdumas@hpac > ./benchmark-ftrsm -p 0 Time: 0.547819 Gflops: 14.6034 -q 1009 -m 2000 -n 2000 -i 3 -f "" -g "" -t 32 -b 32 -p 0 jgdumas@hpac > ./benchmark-pluq -p 0 Time: 0.467326 Gflops: 11.4124 BC: 140734330533600 -q 131071 -m 2000 -n 2000 -r 2000 -i 3 -v 0 -t 1 -b 1 -p N
Watch out: benchmark-pluq on master currently uses ZRing
Pluq with double
looks depressingly slow on AVX512 machines, compared to AVX2 (ftrsm may be to blame, and @ClementPernet suggested it might be an issue with the backend blas?)
- AVX512
benchmark-pluq
Time: 0.714751 Gfops: 7.46181 -s N -q 131071 -m 2000 -n 2000 -r 2000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
Time: 3.41686 Gfops: 12.4871 -s N -q 131071 -m 4000 -n 4000 -r 4000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
benchmark-ftrsm
Time: 0.60976 Gfops: 13.1199 -q 131071 -m 2000 -n 2000 -i 7 -f "" -g "" -t 24 -b 24 -p 0
Time: 2.80451 Gfops: 22.8204 -q 131071 -m 4000 -n 4000 -i 7 -f "" -g "" -t 24 -b 24 -p 0
- AVX2
benchmark-pluq
Time: 0.265393 Gfops: 20.096 -s N -q 131071 -m 2000 -n 2000 -r 2000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
Time: 1.56435 Gfops: 27.2743 -s N -q 131071 -m 4000 -n 4000 -r 4000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
benchmark-ftrsm
Time: 0.187597 Gfops: 42.6445 -q 131071 -m 2000 -n 2000 -i 7 -f "" -g "" -t 1 -b 1 -p 0
Time: 1.31981 Gfops: 48.492 -q 131071 -m 4000 -n 4000 -i 7 -f "" -g "" -t 1 -b 1 -p 0
Things seem fine with int64_t
tho:
- AVX512
benchmark-pluq
Time: 0.397803 Gfops: 13.407 -s N -q 268435399 -m 2000 -n 2000 -r 2000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
Time: 1.90285 Gfops: 22.4225 -s N -q 268435399 -m 4000 -n 4000 -r 4000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
- AVX2
benchmark-pluq
Time: 0.34046 Gfops: 15.6651 -s N -q 268435399 -m 2000 -n 2000 -r 2000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
Time: 2.16316 Gfops: 19.7242 -s N -q 268435399 -m 4000 -n 4000 -r 4000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
A small comment about how things go with PR #250 on hpac:
Before:
karpman@hpac>./benchmark-fgemm -p 0 -i 7
Time: 0.895271 Gfops: 17.8717 -q 131071 -m 2000 -k 2000 -n 2000 -w -1 -i 7 -p 0 -t 32 -b 32
karpman@hpac>./benchmark-ftrsm -p 0 -q 131071 -i 7
Time: 0.550655 Gfops: 14.5281 -q 131071 -m 2000 -n 2000 -i 7 -f "" -g "" -t 32 -b 32 -p 0
karpman@hpac>./benchmark-pluq -p 0 -q 131071 -i 7
Time: 0.767794 Gfops: 6.94631 -s N -q 131071 -m 2000 -n 2000 -r 2000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
After:
karpman@hpac>./benchmark-fgemm -p 0 -i 7
Time: 0.913098 Gfops: 17.5228 -q 131071 -m 2000 -k 2000 -n 2000 -w -1 -i 7 -p 0 -t 32 -b 32
karpman@hpac>./benchmark-ftrsm -p 0 -q 131071 -i 7
Time: 0.555532 Gfops: 14.4006 -q 131071 -m 2000 -n 2000 -i 7 -f "" -g "" -t 32 -b 32 -p 0
karpman@hpac>./benchmark-pluq -p 0 -q 131071 -i 7
Time: 0.570384 Gfops: 9.35043 -s N -q 131071 -m 2000 -n 2000 -r 2000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
(Note that @jgdumas's faster run of pluq at 11.4124 Gfops was likely using Zring and not ModularBalanced, so it shouldn't be a regression to go back to only 9.35 Gfops.)
After a quick check, the slowness on AVX-512 indeed seems to be caused by the backend blas. Switching to openblas, I get:
benchmark-pluq
Time: 0.222758 Gfops: 23.9423 -s N -q 131071 -m 2000 -n 2000 -r 2000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
Time: 1.41386 Gfops: 30.1774 -s N -q 131071 -m 4000 -n 4000 -r 4000 -g Y -i 7 -v 0 -t 1 -b 1 -p N
benchmark-ftrsm
Time: 0.210032 Gfops: 38.0894 -q 131071 -m 2000 -n 2000 -i 7 -f "" -g "" -t 24 -b 24 -p 0
Time: 1.45244 Gfops: 44.0638 -q 131071 -m 4000 -n 4000 -i 7 -f "" -g "" -t 24 -b 24 -p 0
This is still not particularly good compared to what I get on my laptop, tho 😕