quda MMA-izing the prolongator and restrictor kernels

MMA-izing the prolongator and restrictor kernels.

Sep 27 '24 13:09 hummingtree

Thanks @hummingtree. Noting this PR is dependent on #1489.

Sep 27 '24 16:09 maddyscientist

When we zero pad when using less RHS than what we support with the tensor core kernels, e.g., using 4 RHS with native 8 RHS tensor core support, can we make sure that the flops are counted correctly, and we don't count imaginary flops?

Dec 06 '24 23:12 maddyscientist

As far as I can tell it doesn't look like the MMA prolongator/restrictor is getting picked up with staggered on Hopper, at least using the default cmake parameters:

cmake -DQUDA_DIRAC_DEFAULT_OFF=ON \
      -DQUDA_DIRAC_STAGGERED=ON \
      -DCMAKE_BUILD_TYPE=DEVEL \
      -DQUDA_BACKWARDS=ON \
      -DCMAKE_INSTALL_PREFIX=/scratch/local/install \
      -DQUDA_MULTIGRID=ON \
      -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" \
      -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
      -DQUDA_DOWNLOAD_USQCD=ON \
      -DQUDA_QIO=ON -DQUDA_QMP=ON \
      -DQUDA_PRECISION=14 \
      -DQUDA_RECONSTRUCT=4 \
      -DQUDA_GPU_ARCH=sm_90 \
      /scratch/local/quda

Commands:

export QUDA_RESOURCE_PATH=`pwd`/tunecache

mpirun -np 1 ./heatbath_test --dim 16 16 16 16 --save-gauge l16t16b7p0   --heatbath-beta 7.0 --heatbath-coldstart true --heatbath-num-steps 10 --heatbath-warmup-steps 1000

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.01 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 --load-gauge l16t16b7p0 \
  --dslash-type asqtad --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 32 --nsrc-tile 8 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
  --mg-transfer-use-mma 0 true --mg-transfer-use-mma 1 true --mg-transfer-use-mma 2 true --mg-transfer-use-mma 3 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

I can confirm the dslash and setup flags are being picked up correctly (but I was a bit surprised to see only 1xfp16 was getting used for the half-precision CalculateYhat, it felt a little low...), so I think it's uniquely the prolongator/restrictor that's having plumbing issues, at least for the default.

Dec 20 '24 21:12 weinbe2

I have fixed an issue where I wasn't seeing nVec_actual show up in the tune string. This was caused by the FieldCache key being ignorant of this variable (and indeed nVec itself), and so we could fetch a field from the cache for the reorder temporary from create_color_spinor_copy that had a different nVec_actual value in it.

Dec 20 '24 23:12 maddyscientist

As far as I can tell it doesn't look like the MMA prolongator/restrictor is getting picked up with staggered on Hopper, at least using the default cmake parameters:

cmake -DQUDA_DIRAC_DEFAULT_OFF=ON \
      -DQUDA_DIRAC_STAGGERED=ON \
      -DCMAKE_BUILD_TYPE=DEVEL \
      -DQUDA_BACKWARDS=ON \
      -DCMAKE_INSTALL_PREFIX=/scratch/local/install \
      -DQUDA_MULTIGRID=ON \
      -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" \
      -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
      -DQUDA_DOWNLOAD_USQCD=ON \
      -DQUDA_QIO=ON -DQUDA_QMP=ON \
      -DQUDA_PRECISION=14 \
      -DQUDA_RECONSTRUCT=4 \
      -DQUDA_GPU_ARCH=sm_90 \
      /scratch/local/quda

Commands:

export QUDA_RESOURCE_PATH=`pwd`/tunecache

mpirun -np 1 ./heatbath_test --dim 16 16 16 16 --save-gauge l16t16b7p0   --heatbath-beta 7.0 --heatbath-coldstart true --heatbath-num-steps 10 --heatbath-warmup-steps 1000

mpirun -np 1 ./staggered_invert_test \
  --prec double --prec-sloppy single --prec-null half --prec-precondition half \
  --mass 0.01 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 --load-gauge l16t16b7p0 \
  --dslash-type asqtad --compute-fat-long true --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 32 --nsrc-tile 8 --niter 25 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
  --mg-transfer-use-mma 0 true --mg-transfer-use-mma 1 true --mg-transfer-use-mma 2 true --mg-transfer-use-mma 3 true \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

I can confirm the dslash and setup flags are being picked up correctly (but I was a bit surprised to see only 1xfp16 was getting used for the half-precision CalculateYhat, it felt a little low...), so I think it's uniquely the prolongator/restrictor that's having plumbing issues, at least for the default.

This has been fixed in https://github.com/lattice/quda/pull/1497/commits/85da7a329b41d2d8dd55330fed972f320a0db7fe.

Jan 03 '25 20:01 hummingtree

Good news: this passes a visual review! Bad news: I hit an issue that's only present with --mg-dslash-use-mma enabled, single precision, Nc = 96... and it goes away if I disable auto-tuning, so I've attached the tunecache as well... This is on Hopper SXM 80GB. I'm not sure if I can trigger it with a single GPU build since it's tuning specific...

cmake command:

cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \
  -DQUDA_GPU_ARCH=sm_90 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON \
  -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 \
  -DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
  /scratch/local/quda

Command---with the tunecache I have, it only triggers with single precision, --mg-dslash-use-mma 3 true, 4 level solve... and the issue only hits on the coarsest level; printout of error below.

PREC="single"

mpirun -np 1 ./staggered_invert_test \
  --prec single --prec-sloppy single --prec-null $PREC --prec-precondition $PREC \
  --mass 0.2 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long false --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 32 --nsrc-tile 16 --niter 24 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
  --mg-transfer-use-mma 0 false --mg-transfer-use-mma 1 false --mg-transfer-use-mma 2 false --mg-transfer-use-mma 3 false \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

Here's the error, it's CA-GCR on the coarsest level very quickly, after the first norm check after a dslash; it also breaks with any other solver, so it seems like it's the coarsest dslash itself. It's unique to a batched solve, seemingly independent of --nsrc-tile? So maybe there's some weird corner of parameter space. I can only trigger it with Nc = 96 on the coarsest level, if it's on the intermediate level (replace --mg-nvec 1 64 with 96) things are fine...

[...]
MG level 2 (GPU): GCR:     0 iterations, n = 8, <r,r> = 4.825448e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 9, <r,r> = 4.814075e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 10, <r,r> = 4.881138e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 11, <r,r> = 4.742637e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 12, <r,r> = 4.829302e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 13, <r,r> = 4.812252e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 14, <r,r> = 4.755250e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 15, <r,r> = 4.762029e+04, |r|/|b| = 1.000000e+00
MG level 3 (GPU): CA-GCR:     0 iterations, n = 0, <r,r> =       nan, |r|/|b| =       nan
MG level 3 (GPU): ERROR: Solver appears to have diverged for n = 0 (rank 0, host ipp2-0709.nvidia.com, solver.cpp:479 in void quda::Solver::PrintStats(const char*, int, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)())
MG level 3 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=12288,aux=zero,color_spinor_field.cpp,409)
MG level 3 (GPU): Saving 1207 sets of cached parameters to /scratch/local/build/tests/tunecache/tunecache_error.tsv

Reference tunecache: tunecache_fail.tar.gz

Commit id: 49c0a583acf99faf106d96041762e9b6d9f7b60a

Jan 15 '25 22:01 weinbe2

Infinitely cleaner command... thanks @hummingtree

            for PREC in half single
            do
                mpirun -n 1 ./multigrid_benchmark_test --test 0 --dim 2 2 2 2 --niter 10 --nsrc 8 --prec-sloppy ${PREC} --mg-nvec 0 96 --mg-dslash-use-mma 0 true
            done

Jan 15 '25 22:01 weinbe2

Good news: this passes a visual review! Bad news: I hit an issue that's only present with --mg-dslash-use-mma enabled, single precision, Nc = 96... and it goes away if I disable auto-tuning, so I've attached the tunecache as well... This is on Hopper SXM 80GB. I'm not sure if I can trigger it with a single GPU build since it's tuning specific...

cmake command:

cmake -DCMAKE_BUILD_TYPE=RELEASE -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON \
  -DQUDA_GPU_ARCH=sm_90 -DQUDA_DOWNLOAD_USQCD=ON -DQUDA_QIO=ON -DQUDA_QMP=ON \
  -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 \
  -DQUDA_MULTIGRID=ON -DQUDA_MULTIGRID_NVEC_LIST="24,64,96" -DQUDA_MULTIGRID_MRHS_LIST="8,16,32" \
  /scratch/local/quda

Command---with the tunecache I have, it only triggers with single precision, --mg-dslash-use-mma 3 true, 4 level solve... and the issue only hits on the coarsest level; printout of error below.

PREC="single"

mpirun -np 1 ./staggered_invert_test \
  --prec single --prec-sloppy single --prec-null $PREC --prec-precondition $PREC \
  --mass 0.2 --recon 18 --recon-sloppy 18 --recon-precondition 18 \
  --dim 16 16 16 16 --gridsize 1 1 1 1 \
  --dslash-type staggered --compute-fat-long false --tadpole-coeff 0.905160183 --tol 1e-10 \
  --verbosity verbose --solve-type direct --solution-type mat --inv-type gcr \
  --inv-multigrid true --mg-levels 4 --mg-coarse-solve-type 0 direct --mg-staggered-coarsen-type kd-optimized \
  --mg-block-size 0 1 1 1 1 --mg-nvec 0 3 \
  --mg-block-size 1 4 4 4 4 --mg-nvec 1 64 --mg-nvec-batch 1 32 \
  --mg-block-size 2 2 2 2 2 --mg-nvec 2 96 --mg-nvec-batch 2 32 \
  --mg-setup-tol 1 1e-5 --mg-setup-tol 2 1e-5 --mg-setup-inv 1 cgnr --mg-setup-inv 2 cgnr \
  --nsrc 32 --nsrc-tile 16 --niter 24 \
  --mg-setup-use-mma 0 true --mg-setup-use-mma 1 true --mg-setup-use-mma 2 true --mg-setup-use-mma 3 true \
  --mg-dslash-use-mma 0 true --mg-dslash-use-mma 1 true --mg-dslash-use-mma 2 true --mg-dslash-use-mma 3 true \
  --mg-transfer-use-mma 0 false --mg-transfer-use-mma 1 false --mg-transfer-use-mma 2 false --mg-transfer-use-mma 3 false \
  --mg-smoother 0 ca-gcr --mg-smoother-solve-type 0 direct  --mg-nu-pre 0 0 --mg-nu-post 0 4 \
  --mg-smoother 1 ca-gcr --mg-smoother-solve-type 1 direct --mg-nu-pre 1 0 --mg-nu-post 1 4 \
  --mg-smoother 2 ca-gcr --mg-smoother-solve-type 2 direct-pc  --mg-nu-pre 2 0 --mg-nu-post 2 4 \
  --mg-coarse-solver 1 gcr --mg-coarse-solve-type 1 direct --mg-coarse-solver-tol 1 0.25 --mg-coarse-solver-maxiter 1 16 \
  --mg-coarse-solver 2 gcr --mg-coarse-solve-type 2 direct-pc --mg-coarse-solver-tol 2 0.25 --mg-coarse-solver-maxiter 2 16 \
  --mg-coarse-solver 3 ca-gcr --mg-coarse-solve-type 3 direct-pc --mg-coarse-solver-tol 3 0.25 --mg-coarse-solver-maxiter 3 16 \
  --mg-verbosity 0 verbose --mg-verbosity 1 verbose --mg-verbosity 2 verbose --mg-verbosity 3 verbose

Here's the error, it's CA-GCR on the coarsest level very quickly, after the first norm check after a dslash; it also breaks with any other solver, so it seems like it's the coarsest dslash itself. It's unique to a batched solve, seemingly independent of --nsrc-tile? So maybe there's some weird corner of parameter space. I can only trigger it with Nc = 96 on the coarsest level, if it's on the intermediate level (replace --mg-nvec 1 64 with 96) things are fine...

[...]
MG level 2 (GPU): GCR:     0 iterations, n = 8, <r,r> = 4.825448e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 9, <r,r> = 4.814075e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 10, <r,r> = 4.881138e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 11, <r,r> = 4.742637e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 12, <r,r> = 4.829302e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 13, <r,r> = 4.812252e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 14, <r,r> = 4.755250e+04, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR:     0 iterations, n = 15, <r,r> = 4.762029e+04, |r|/|b| = 1.000000e+00
MG level 3 (GPU): CA-GCR:     0 iterations, n = 0, <r,r> =       nan, |r|/|b| =       nan
MG level 3 (GPU): ERROR: Solver appears to have diverged for n = 0 (rank 0, host ipp2-0709.nvidia.com, solver.cpp:479 in void quda::Solver::PrintStats(const char*, int, quda::cvector<double>&, quda::cvector<double>&, quda::cvector<double>&)())
MG level 3 (GPU):        last kernel called was (name=cudaMemsetAsync,volume=bytes=12288,aux=zero,color_spinor_field.cpp,409)
MG level 3 (GPU): Saving 1207 sets of cached parameters to /scratch/local/build/tests/tunecache/tunecache_error.tsv

Reference tunecache: tunecache_fail.tar.gz

Commit id: 49c0a58

Thanks Evan for the tests! This should have been fixed in https://github.com/lattice/quda/pull/1497/commits/e8ca86924e6ed917717f4c062e3a2573d6ff11a5.

Jan 16 '25 17:01 hummingtree

cscs-ci run

Jan 21 '25 14:01 weinbe2