MMA-izing the prolongator and restrictor kernels
MMA-izing the prolongator and restrictor kernels.
Thanks @hummingtree. Noting this PR is dependent on #1489.
When we zero pad when using less RHS than what we support with the tensor core kernels, e.g., using 4 RHS with native 8 RHS tensor core support, can we make sure that the flops are counted correctly, and we don't count imaginary flops?
As far as I can tell it doesn't look like the MMA prolongator/restrictor is getting picked up with staggered on Hopper, at least using the default cmake parameters:
cmake -DQUDA_DIRAC_DEFAULT_OFF=ON \
-DQUDA_DIRAC_STAGGERED=ON \
-DCMAKE_BUILD_TYPE=DEVEL \
-DQUDA_BACKWARDS=ON \
-DCMAKE_INSTALL_PREFIX=/scratch/local/install \
-DQUDA_MULTIGRID=ON \
-DQUDA_MULTIGRID_NVEC_LIST="24,64,96" \
-DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
-DQUDA_DOWNLOAD_USQCD=ON \
-DQUDA_QIO=ON -DQUDA_QMP=ON \
-DQUDA_PRECISION=14 \
-DQUDA_RECONSTRUCT=4 \
-DQUDA_GPU_ARCH=sm_90 \
/scratch/local/quda
Commands:
export QUDA_RESOURCE_PATH=`pwd`/tunecache
mpirun -np 1 ./heatbath_test --dim 16 16 16 16 --save-gauge l16t16b7p0 --heatbath-beta 7.0 --heatbath-coldstart true --heatbath-num-steps 10 --heatbath-warmup-steps 1000
mpirun -np 1 ./staggered_invert_test \
--prec double --prec-sloppy single --prec-null half --prec-precondition half \
--mass 0.01 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
--dim 16 16 16 16 --gridsize 1 1 1 1 --load-gauge l16t16b7p0 \
--dslash-type asqtad --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
--verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
--inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
--mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
--mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
--mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
--mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
--nsrc 32 --nsrc-tile 8 --niter 25 \
--mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
--mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
--mg-transfer-use-mma 0 true --mg-transfer-use-mma 1 true --mg-transfer-use-mma 2 true --mg-transfer-use-mma 3 true \
--mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct --mg-nu-pre 0 0 --mg-nu-post 0 4 \
--mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
--mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc --mg-nu-pre 2 0 --mg-nu-post 2 4 \
--mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
--mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
--mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
--mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose
I can confirm the dslash and setup flags are being picked up correctly (but I was a bit surprised to see only 1xfp16 was getting used for the half-precision CalculateYhat, it felt a little low...), so I think it's uniquely the prolongator/restrictor that's having plumbing issues, at least for the default.
I have fixed an issue where I wasn't seeing nVec_actual show up in the tune string. This was caused by the FieldCache key being ignorant of this variable (and indeed nVec itself), and so we could fetch a field from the cache for the reorder temporary from create_color_spinor_copy that had a different nVec_actual value in it.
As far as I can tell it doesn't look like the MMA prolongator/restrictor is getting picked up with staggered on Hopper, at least using the default
cmakeparameters:cmake -DQUDA_DIRAC_DEFAULT_OFF=ON \ -DQUDA_DIRAC_STAGGERED=ON \ -DCMAKE_BUILD_TYPE=DEVEL \ -DQUDA_BACKWARDS=ON \ -DCMAKE_INSTALL_PREFIX=/scratch/local/install \ -DQUDA_MULTIGRID=ON \ -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" \ -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \ -DQUDA_DOWNLOAD_USQCD=ON \ -DQUDA_QIO=ON -DQUDA_QMP=ON \ -DQUDA_PRECISION=14 \ -DQUDA_RECONSTRUCT=4 \ -DQUDA_GPU_ARCH=sm_90 \ /scratch/local/qudaCommands:
export QUDA_RESOURCE_PATH=`pwd`/tunecache mpirun -np 1 ./heatbath_test --dim 16 16 16 16 --save-gauge l16t16b7p0 --heatbath-beta 7.0 --heatbath-coldstart true --heatbath-num-steps 10 --heatbath-warmup-steps 1000 mpirun -np 1 ./staggered_invert_test \ --prec double --prec-sloppy single --prec-null half --prec-precondition half \ --mass 0.01 --recon 18 --recon-sloppy 18 --recon-precondition 18 \ --dim 16 16 16 16 --gridsize 1 1 1 1 --load-gauge l16t16b7p0 \ --dslash-type asqtad --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \ --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \ --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \ --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \ --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \ --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \ --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \ --nsrc 32 --nsrc-tile 8 --niter 25 \ --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \ --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \ --mg-transfer-use-mma 0 true --mg-transfer-use-mma 1 true --mg-transfer-use-mma 2 true --mg-transfer-use-mma 3 true \ --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct --mg-nu-pre 0 0 --mg-nu-post 0 4 \ --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \ --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc --mg-nu-pre 2 0 --mg-nu-post 2 4 \ --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \ --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \ --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \ --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verboseI can confirm the dslash and setup flags are being picked up correctly (but I was a bit surprised to see only
1xfp16was getting used for the half-precisionCalculateYhat, it felt a little low...), so I think it's uniquely the prolongator/restrictor that's having plumbing issues, at least for the default.
This has been fixed in https://github.com/lattice/quda/pull/1497/commits/85da7a329b41d2d8dd55330fed972f320a0db7fe.
Good news: this passes a visual review! Bad news: I hit an issue that's only present with --mg-dslash-use-mma enabled, single precision, Nc = 96... and it goes away if I disable auto-tuning, so I've attached the tunecache as well... This is on Hopper SXM 80GB. I'm not sure if I can trigger it with a single GPU build since it's tuning specific...
cmake command:
cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \
-DQUDA_GPU_ARCH=sm_90 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON \
-DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 \
-DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
/scratch/local/quda
Command---with the tunecache I have, it only triggers with single precision, --mg-dslash-use-mma 3 true, 4 level solve... and the issue only hits on the coarsest level; printout of error below.
PREC="single"
mpirun -np 1 ./staggered_invert_test \
--prec single --prec-sloppy single --prec-null $PREC --prec-precondition $PREC \
--mass 0.2 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
--dim 16 16 16 16 --gridsize 1 1 1 1 \
--dslash-type staggered --compute-fat-long false --tadpole-coeff 0.905160183 --tol 1e-10 \
--verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
--inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
--mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
--mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
--mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
--mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
--nsrc 32 --nsrc-tile 16 --niter 24 \
--mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
--mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
--mg-transfer-use-mma 0 false --mg-transfer-use-mma 1 false --mg-transfer-use-mma 2 false --mg-transfer-use-mma 3 false \
--mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct --mg-nu-pre 0 0 --mg-nu-post 0 4 \
--mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
--mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc --mg-nu-pre 2 0 --mg-nu-post 2 4 \
--mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
--mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
--mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
--mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose
Here's the error, it's CA-GCR on the coarsest level very quickly, after the first norm check after a dslash; it also breaks with any other solver, so it seems like it's the coarsest dslash itself. It's unique to a batched solve, seemingly independent of --nsrc-tile? So maybe there's some weird corner of parameter space. I can only trigger it with Nc = 96 on the coarsest level, if it's on the intermediate level (replace --mg-nvec 1 64 with 96) things are fine...
[...]
MG level 2 (GPU): GCR: 0 iterations, n = 8, <r,r> = 4.825448e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 0 iterations, n = 9, <r,r> = 4.814075e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 0 iterations, n = 10, <r,r> = 4.881138e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 0 iterations, n = 11, <r,r> = 4.742637e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 0 iterations, n = 12, <r,r> = 4.829302e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 0 iterations, n = 13, <r,r> = 4.812252e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 0 iterations, n = 14, <r,r> = 4.755250e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 0 iterations, n = 15, <r,r> = 4.762029e+04, |r|/|b| = 1.000000e+00
MG level 3 (GPU): CA-GCR: 0 iterations, n = 0, <r,r> = nan, |r|/|b| = nan
MG level 3 (GPU): ERROR: Solver appears to have diverged for n = 0 (rank 0, host ipp2-0709.nvidia.com, solver.cpp:479 in void quda::Solver::PrintStats(const char*, int, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)())
MG level 3 (GPU): last kernel called was (name=cudaMemsetAsync,volume=bytes=12288,aux=zero,color_spinor_field.cpp,409)
MG level 3 (GPU): Saving 1207 sets of cached parameters to /scratch/local/build/tests/tunecache/tunecache_error.tsv
Reference tunecache: tunecache_fail.tar.gz
Commit id: 49c0a583acf99faf106d96041762e9b6d9f7b60a
Infinitely cleaner command... thanks @hummingtree
for PREC in half single
do
mpirun -n 1 ./multigrid_benchmark_test --test 0 --dim 2 2 2 2 --niter 10 --nsrc 8 --prec-sloppy ${PREC} --mg-nvec 0 96 --mg-dslash-use-mma 0 true
done
Good news: this passes a visual review! Bad news: I hit an issue that's only present with
--mg-dslash-use-mmaenabled, single precision, Nc = 96... and it goes away if I disable auto-tuning, so I've attached the tunecache as well... This is on Hopper SXM 80GB. I'm not sure if I can trigger it with a single GPU build since it's tuning specific...
cmakecommand:cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \ -DQUDA_GPU_ARCH=sm_90 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON \ -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 \ -DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \ /scratch/local/qudaCommand---with the tunecache I have, it only triggers with single precision,
--mg-dslash-use-mma 3 true, 4 level solve... and the issue only hits on the coarsest level; printout of error below.PREC="single" mpirun -np 1 ./staggered_invert_test \ --prec single --prec-sloppy single --prec-null $PREC --prec-precondition $PREC \ --mass 0.2 --recon 18 --recon-sloppy 18 --recon-precondition 18 \ --dim 16 16 16 16 --gridsize 1 1 1 1 \ --dslash-type staggered --compute-fat-long false --tadpole-coeff 0.905160183 --tol 1e-10 \ --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \ --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \ --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \ --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \ --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \ --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \ --nsrc 32 --nsrc-tile 16 --niter 24 \ --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \ --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \ --mg-transfer-use-mma 0 false --mg-transfer-use-mma 1 false --mg-transfer-use-mma 2 false --mg-transfer-use-mma 3 false \ --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct --mg-nu-pre 0 0 --mg-nu-post 0 4 \ --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \ --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc --mg-nu-pre 2 0 --mg-nu-post 2 4 \ --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \ --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \ --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \ --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verboseHere's the error, it's CA-GCR on the coarsest level very quickly, after the first norm check after a dslash; it also breaks with any other solver, so it seems like it's the coarsest dslash itself. It's unique to a batched solve, seemingly independent of
--nsrc-tile? So maybe there's some weird corner of parameter space. I can only trigger it with Nc = 96 on the coarsest level, if it's on the intermediate level (replace--mg-nvec 1 64with96) things are fine...[...] MG level 2 (GPU): GCR: 0 iterations, n = 8, <r,r> = 4.825448e+04, |r|/|b| = 1.000000e+00 MG level 2 (GPU): GCR: 0 iterations, n = 9, <r,r> = 4.814075e+04, |r|/|b| = 1.000000e+00 MG level 2 (GPU): GCR: 0 iterations, n = 10, <r,r> = 4.881138e+04, |r|/|b| = 1.000000e+00 MG level 2 (GPU): GCR: 0 iterations, n = 11, <r,r> = 4.742637e+04, |r|/|b| = 1.000000e+00 MG level 2 (GPU): GCR: 0 iterations, n = 12, <r,r> = 4.829302e+04, |r|/|b| = 1.000000e+00 MG level 2 (GPU): GCR: 0 iterations, n = 13, <r,r> = 4.812252e+04, |r|/|b| = 1.000000e+00 MG level 2 (GPU): GCR: 0 iterations, n = 14, <r,r> = 4.755250e+04, |r|/|b| = 1.000000e+00 MG level 2 (GPU): GCR: 0 iterations, n = 15, <r,r> = 4.762029e+04, |r|/|b| = 1.000000e+00 MG level 3 (GPU): CA-GCR: 0 iterations, n = 0, <r,r> = nan, |r|/|b| = nan MG level 3 (GPU): ERROR: Solver appears to have diverged for n = 0 (rank 0, host ipp2-0709.nvidia.com, solver.cpp:479 in void quda::Solver::PrintStats(const char*, int, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)()) MG level 3 (GPU): last kernel called was (name=cudaMemsetAsync,volume=bytes=12288,aux=zero,color_spinor_field.cpp,409) MG level 3 (GPU): Saving 1207 sets of cached parameters to /scratch/local/build/tests/tunecache/tunecache_error.tsvReference tunecache: tunecache_fail.tar.gz
Commit id: 49c0a58
Thanks Evan for the tests! This should have been fixed in https://github.com/lattice/quda/pull/1497/commits/e8ca86924e6ed917717f4c062e3a2573d6ff11a5.
cscs-ci run