quda
quda copied to clipboard
Staggered heavy-quark residual fails to regulate HISQ CG convergence properly with an odd checkerboard source
When the inversion source has support on only even sites, convergence seems to be OK with a tiny heavy-quark residual. But when a source has support on only odd sites, the residual decreases at a glacial rate while the inverter runs out of restarts, leaving a heavy-quark residual of order one or two hundred. The propagator solution at moderate distance from the source seems to be very sensitive to the input stopping conditions, so I suspect it is not properly converged.
Thanks for the info, Carleton. Do you have a reference MILC input file I can use to reproduce this? Also, what ensemble(s) have you been seeing this on?
Thank you for helping with this, Evan. I should say first that I am building QUDA with Jim Simone's branch, "not_a_feature_rather_a_hope/staggered_correlator_gk", but he has been merging develop into it. In retrospect, I found the same problem in output logs from Perlmutter with QUDA/develop. So the problem should also be reproducible with the develop branch. It will take a little time to create the reproducer. In the mean time, it would be worth looking at the code to see if there is some obvious difference in the heavy-quark residual treatment between an exclusively even-site and an exclusively odd-site source.
On 5/1/23 10:22 AM, Evan Weinberg wrote:
Thanks for the info, Carleton. Do you have a reference MILC input file I can use to reproduce this? Also, what ensemble(s) have you been seeing this on?
— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1529917151, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXSM2TMDH2LM7DXVTZLXD7PMPANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>
Thanks Carleton, the reproducer may be necessary so I can understand the full workflow. The CG code "doesn't know" about even/odd, it's just handed an operator. The stencil code knows the bare minimum necessary, most relevantly in the prepare/reconstruct bits. You can double check here, though I just did a skim and all looks good (aka "symmetric" between even and odd), check the code for the DiracImprovedStaggeredPC
operator: https://github.com/lattice/quda/blob/develop/lib/dirac_improved_staggered.cpp
Remind me, does MILC use the odd Schur op directly, or does it prepare an even source, use the even op, and reconstruct?
Thanks, Evan. Do you have access to Frontier? Or should I port the reproducer to Summit?
MILC calls QUDA specifying the "parity" of the solve. For a strictly odd parity source (rhs), its even-odd block decomposition has the form
[ 0 ]
[ b_o ]
and, when QUDA is called, specifying odd parity, the solution should have the form
[ 0 ]
[ B b_o ]
where B = 1/(D^2 + 4m^2).
The MILC code "reconsructs" by multiplying by M^\dagger:
[ -D B b_o ]
[ 2m B b_o ]
For a strictly even-parity source with a call to QUDA specifying even parity, we just interchange even and odd here. There is nothing asymmetric in the MILC treatment outside QUDA.
On 5/2/23 9:18 AM, Evan Weinberg wrote:
Thanks Carleton, the reproducer may be necessary so I can understand the full workflow. The CG code "doesn't know" about even/odd, it's just handed an operator. The stencil code knows the bare minimum necessary, most relevantly in the prepare/reconstruct bits. You can double check here, though I just did a skim and all looks good (aka "symmetric" between even and odd), check the code for the |DiracImprovedStaggeredPC| operator: https://github.com/lattice/quda/blob/develop/lib/dirac_improved_staggered.cpp
Remind me, does MILC use the odd Schur op directly, or does it prepare an even source, use the even op, and reconstruct?
— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1531662914, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXRT2PHHSMBTTKNVUFDXEEQVFANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>
Thanks Carleton. I'm in the moving and I'm not quite sure where my keyfob is right now---can you send me your submit script and input file via Slack or e-mail? I have configurations of various sizes (64^3, 96^3, 144^3, 192^3) on our internal cluster, hopefully one of those will suffice to reproduce the issue.
Also, thank you for describing the measurement. Between your description and my code investigations, I can't spot any inherent issue/asymmetry, but extra investigations will clearly be in order.
One question---is the host source in MILC single parity or the length of the full volume? It looks like qudaInvert
is assuming it is a full volume source (contiguous even and odd), and as such there is an offset into just the odd part. This has been the assumption in the code for a very, very long time---I checked, and it's been this way for 10 years!!
If in reality the source in MILC is single parity (just odd), this could give garbage---and if the off chance the memory it points to is zero, it's reasonable that it would be a very slow (as well as meaningless) solve.
For reference:
- The offset calculation routine is called here: https://github.com/lattice/quda/blob/develop/lib/milc_interface.cpp#L1203 (note the hard-coded
false
) - And the offset routine itself is called here: https://github.com/lattice/quda/blob/develop/lib/milc_interface.cpp#L1031 ; note that, since the second argument is false, it'll give an offset halfway into the lattice (
volume * 6 / 2
, or full volume times Nc == 3 times 2 for complex divided by 2 for half of the lattice).
All of the MILC color vector field are full.
On 5/3/23 9:03 AM, Evan Weinberg wrote:
One question---is the host source in MILC single parity or the length of the full volume? It looks like |qudaInvert| is assuming it is a full volume source (contiguous even and odd), and as such there is an offset into just the odd part. This has been the assumption in the code for a very, very long time---I checked, and it's been this way for 10 years!!
If in reality the source in MILC is single parity (just odd), this could give garbage---and if the off chance the memory it points to is zero, it's reasonable that it would be a very slow (as well as meaningless) solve.
For reference:
- The offset calculation routine is called here: https://github.com/lattice/quda/blob/develop/lib/milc_interface.cpp#L1203 (note the hard-coded |false|)
- And the offset routine itself is called here: https://github.com/lattice/quda/blob/develop/lib/milc_interface.cpp#L1031 ; note that, since the second argument is false, it'll give an offset halfway into the lattice (|volume * 6 / 2|, or full volume times Nc == 3 times 2 for complex divided by 2 for half of the lattice).
— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1533203090, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXR3JPQKZP25GVDMWUDXEJXVLANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>
Hi Evan,
I sent a bunch of files in my last message, but I just discovered that our MIMEdefang probably deleted the shell script attachments. So here they are with modifed extensions.
Best,
Carleton
On 5/3/23 9:03 AM, Evan Weinberg wrote:
One question---is the host source in MILC single parity or the length of the full volume? It looks like |qudaInvert| is assuming it is a full volume source (contiguous even and odd), and as such there is an offset into just the odd part. This has been the assumption in the code for a very, very long time---I checked, and it's been this way for 10 years!!
If in reality the source in MILC is single parity (just odd), this could give garbage---and if the off chance the memory it points to is zero, it's reasonable that it would be a very slow (as well as meaningless) solve.
For reference:
- The offset calculation routine is called here: https://github.com/lattice/quda/blob/develop/lib/milc_interface.cpp#L1203 (note the hard-coded |false|)
- And the offset routine itself is called here: https://github.com/lattice/quda/blob/develop/lib/milc_interface.cpp#L1031 ; note that, since the second argument is false, it'll give an offset halfway into the lattice (|volume * 6 / 2|, or full volume times Nc == 3 times 2 for complex divided by 2 for half of the lattice).
— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1533203090, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXR3JPQKZP25GVDMWUDXEJXVLANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>
#! /bin/bash
source env.sh
QUDA_INSTALL=${INSTALLROOT}/quda
LIBQUDA="-Wl,-rpath ${QUDA_INSTALL}/lib -L${QUDA_INSTALL}/lib -lquda -D__gfx90a --amdgpu-target=gfx90a -Wl,-rpath=${ROCM_PATH}/hiprand/lib -L${ROCM_PATH}/hiprand/lib -Wl,-rpath=${ROCM_PATH}/rocfft/lib -L${ROCM_PATH}/rocfft/lib -lhiprand -lrocfft -Wl,-rpath=${ROCM_PATH}/hipblas/lib -L${ROCM_PATH}/hipblas/lib -lhipblas -Wl,-rpath=${ROCM_PATH}/rocblas/lib -L${ROCM_PATH}/rocblas/lib -lrocblas -Wl,-rpath=${ROCM_PATH}/hip/lib"
############ Make ks_spectrum_hisq ################## cd milc_qcd/ks_spectrum cp ../Makefile . make clean
MY_CC=hipcc
MY_CXX=hipcc
ARCH=""
COMPILER="gnu"
OFFLOAD="HIP"
OPT="-O3 -Ofast -g"
PATH_TO_NVHPCSDK=""
CUDA_HOME=""
QUDA_HOME=${QUDA_INSTALL}
QUDA_VERBOSITY=VERBOSE
WANTQUDA=true
WANT_FN_CG_GPU=true
WANT_FL_GPU=true
WANT_GF_GPU=true
WANT_FF_GPU=true
WANT_KS_CONT_GPU=true
WANT_SHIFT_GPU=true
WANT_SPIN_TASTE_GPU=true
WANT_GAUGEFIX_OVR_GPU=true
WANT_MIXED_PRECISION_GPU=1
PRECISION=2
MPP=true
OMP=true
WANTQIO=true
WANTQMP=true
QIOPAR=/ccs/home/detar/frontier/quda/install/quda
QMPPAR=/ccs/home/detar/frontier/quda/install/quda
LIBQUDA=${LIBQUDA}
CGEOM="-DFIX_NODE_GEOM -DFIX_IONODE_GEOM"
KSCGMULTI="-DKS_MULTICG=HYBRID -DMULTISOURCE -DMULTIGRID"
CTIME="-DNERSC_TIME -DCGTIME -DFFTIME -DFLTIME -DGFTIME -DREMAP -DPRTIME -DIOTIME"
make -j 1 ks_spectrum_hisq
cd ..
############ Make su3_rhmd_hisq ################## cd ks_imp_rhmc cp ../Makefile . make clean
MY_CC=hipcc
MY_CXX=hipcc
ARCH=""
COMPILER="gnu"
OFFLOAD="HIP"
OPT="-O3 -Ofast"
PATH_TO_NVHPCSDK=""
CUDA_HOME=""
QUDA_HOME=${QUDA_INSTALL}
WANTQUDA=true
WANT_FN_CG_GPU=true
WANT_FL_GPU=true
WANT_GF_GPU=true
WANT_FF_GPU=true
WANT_GAUGEFIX_OVR_GPU=true
WANT_MIXED_PRECISION_GPU=2
PRECISION=1
MPP=true
OMP=true
WANTQIO=true
WANTQMP=true
QIOPAR=/ccs/home/detar/frontier/quda/install/quda
QMPPAR=/ccs/home/detar/frontier/quda/install/quda
LIBQUDA=${LIBQUDA}
CGEOM="-DFIX_NODE_GEOM -DFIX_IONODE_GEOM"
KSCGMULTI="-DKS_MULTICG=HYBRID -DMULTISOURCE -DMULTIGRID"
CTIME="-DNERSC_TIME -DCGTIME -DFFTIME -DFLTIME -DGFTIME -DREMAP -DPRTIME -DIOTIME"
make -j 1 su3_rhmd_hisq
cd ../..
#! /bin/bash
BRANCH=not_a_feature_rather_a_hope/staggered_correlator_gk #BRANCH=develop
source env.sh
pushd quda
QUDA_HOME=$(pwd)
if [ -d quda ] then cd quda git pull
git checkout develop
git checkout ${BRANCH} else git clone --branch ${BRANCH} https://github.com/lattice/quda cd quda git checkout ${BRANCH} fi cd ..
mkdir -p build && cd build
cmake
-DCMAKE_BUILD_TYPE=RELEASE
-DCMAKE_CXX_COMPILER=CC
-DCMAKE_CXX_FLAGS="--offload-arch=gfx90a"
-DCMAKE_C_COMPILER=cc
-DCMAKE_C_FLAGS="--offload-arch=gfx90a"
-DCMAKE_C_STANDARD=99
-DCMAKE_EXE_LINKER_FLAGS="--offload-arch=gfx90a"
-DCMAKE_HIP_FLAGS="--offload-arch=gfx90a"
-DCMAKE_INSTALL_PREFIX=${INSTALLROOT}/quda
-DQUDA_BUILD_SHAREDLIB=ON
-DQUDA_CONTRACT=ON
-DQUDA_COVDEV=ON
-DQUDA_DIRAC_DEFAULT_OFF=ON
-DQUDA_DIRAC_STAGGERED=ON
-DQUDA_DOWNLOAD_USQCD=ON
-DQUDA_GPU_ARCH=gfx90a
-DQUDA_QIO=ON
-DQUDA_QMP=ON
-DQUDA_TARGET_TYPE=HIP
-DROCM_PATH=${ROCM_PATH}
${QUDA_HOME}/quda
-DCMAKE_SHARED_LINKER_FLAGS="—-offload-arch=gfx90a" \
make -j16 install
cd ..
#! /bin/bash
#SBATCH -t 10:00
#SBATCH -N 96
#SBATCH -n 768
#SBATCH --cpus-per-task=6
#SBATCH --ntasks-per-node=8
####### -J (command line in spawnjob.py)
#SBATCH -A phy157-ecphisq
#SBATCH -V
####### -C nvme
#SBATCH -S 2
Submission command must define environment the variable RUNCMDFILE
sbatch -N ${NODES} -t ${walltime} -J ${jobname} ${slurm_script}
nodes=96 umask 0022
source env.sh
Run production jobs out of $SCRATCH
SCRATCH_HOME=/gpfs/alpine/proj-shared/phy157/phy157hisq/detar/allHISQ/frontier MYSCRATCH=${SCRATCH_HOME}/l144288f211b700m000569m01555m1827 mkdir -p ${MYSCRATCH} cd ${MYSCRATCH}
QUDA flags
export QUDA_ENABLE_GDR=1
export QUDA_ENABLE_P2P=1
export QUDA_MILC_HISQ_RECONSTRUCT=13
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9
export QUDA_RESOURCE_PATH=pwd
# location of QUDA tunecache file
MPICH flags
export MPICH_GPU_SUPPORT_ENABLED=1 # Allow GDR export MPICH_COLL_SYNC=MPI_Bcast # Sometimes broadcast is not synchronizing.... export MPICH_RDMA_ENABLED_CUDA=1 export MPICH_OFI_NIC_POLICY=BLOCK export MPICH_SMP_SINGLE_COPY_MODE=CMA
Balint 6-CPU Masks for Frontier
export OMP_NUM_THREADS=6 export OMP_PROC_BIND=spread MASK_0="0x003f000000000000" MASK_1="0x3f00000000000000" MASK_2="0x00000000003f0000" MASK_3="0x000000003f000000" MASK_4="0x000000000000007e" MASK_5="0x0000000000007e00" MASK_6="0x0000003f00000000" MASK_7="0x00003f0000000000" MEMBIND="--mem-bind=map_mem:3,3,1,1,0,0,2,2" CPU_MASK="--cpu-bind=mask_cpu:${MASK_0},${MASK_1},${MASK_2},${MASK_3},${MASK_4},${MASK_5},${MASK_6},${MASK_7}"
srun -n 768 -N 96 ks_spectrum_hisq --distribution=*:block ${CPU_MASK} -qmp-geom 2 4 4 24 -qmp-alloc-map 3 2 1 0 -qmp-logic-map 3 2 1 0 intest outtest
#! /bin/bash
#SBATCH -t 10:00 #SBATCH -n 768 #SBATCH --cpus-per-task=6 #SBATCH --ntasks-per-node=8 ####### -J (command line in spawnjob.py) #SBATCH -A phy157-ecphisq #SBATCH -V ####### -C nvme #SBATCH -S 2
Submission command must define environment the variable RUNCMDFILE
sbatch -N ${NODES} -t ${walltime} -J ${jobname} ${slurm_script}
nodes=96 umask 0022
source env.sh
Run production jobs out of $SCRATCH
SCRATCH_HOME=/gpfs/alpine/proj-shared/phy157/phy157hisq/detar/allHISQ/frontier MYSCRATCH=${SCRATCH_HOME}/l144288f211b700m000569m01555m1827 mkdir -p ${MYSCRATCH} cd ${MYSCRATCH}
QUDA flags
export QUDA_ENABLE_GDR=1
export QUDA_ENABLE_P2P=1
export QUDA_MILC_HISQ_RECONSTRUCT=13
export QUDA_MILC_HISQ_RECONSTRUCT_SLOPPY=9
export QUDA_RESOURCE_PATH=pwd
# location of QUDA tunecache file
MPICH flags
export MPICH_GPU_SUPPORT_ENABLED=1 # Allow GDR export MPICH_COLL_SYNC=MPI_Bcast # Sometimes broadcast is not synchronizing.... export MPICH_RDMA_ENABLED_CUDA=1 export MPICH_OFI_NIC_POLICY=BLOCK export MPICH_SMP_SINGLE_COPY_MODE=CMA
Balint 6-CPU Masks for Frontier
export OMP_NUM_THREADS=6 export OMP_PROC_BIND=spread MASK_0="0x003f000000000000" MASK_1="0x3f00000000000000" MASK_2="0x00000000003f0000" MASK_3="0x000000003f000000" MASK_4="0x000000000000007e" MASK_5="0x0000000000007e00" MASK_6="0x0000003f00000000" MASK_7="0x00003f0000000000" MEMBIND="--mem-bind=map_mem:3,3,1,1,0,0,2,2" CPU_MASK="--cpu-bind=mask_cpu:${MASK_0},${MASK_1},${MASK_2},${MASK_3},${MASK_4},${MASK_5},${MASK_6},${MASK_7}"
srun -n 768 -N 96 ks_spectrum_hisq --distribution=*:block ${CPU_MASK} -qmp-geom 2 4 4 24 -qmp-alloc-map 3 2 1 0 -qmp-logic-map 3 2 1 0 intest outtest
#== Geometry == prompt 0 nx 144 ny 144 nz 144 nt 288 node_geometry 2 4 4 24 ionode_geometry 2 4 4 24 iseed 402129 job_id 1311221
#== Gauge == reload_parallel /lustre/orion/proj-shared/phy157/phy157_hisq/detar/allHISQ/l144288f211b700m000569m01555m1827/lat/v5/l144288f211b700m000569m01555m1827a.402 u0 1 no_gauge_fix forget staple_weight 0.05 ape_iter 20 coordinate_origin 0 0 0 0 time_bc antiperiodic
#== Eigen == max_number_of_eigenpairs 0
#== PBP Masses ==
number_of_pbp_masses 0
#== Base Sources ==
number_of_base_sources 1
#== source 0: RandomColorWallSource == random_color_wall field_type KS subset corner t0 129 ncolor 1 momentum 0 0 0 source_label RW forget_source
#== Modified Sources ==
number_of_modified_sources 0
#== KSsolveSets ==
number_of_sets 1
#== KSsolveSet == set_type single inv_type CGZ max_cg_iterations 4000 max_cg_restarts 10 check yes momentum_twist 0 0 0 precision 2 source 0 number_of_propagators 2
#== propagator 0: KSsolveElement == mass 0.000569 naik_term_epsilon 0. error_for_propagator 1e-7 rel_error_for_propagator 0.0 fresh_ksprop forget_ksprop
#== propagator 1: KSsolveElement == mass 0.843 naik_term_epsilon -0.3578 error_for_propagator 0 rel_error_for_propagator 2e-4 fresh_ksprop forget_ksprop
#== Quarks ==
number_of_quarks 2
#== quark 0: QuarkIdentitySink == propagator 0 identity op_label d forget_ksprop
#== quark 1: QuarkIdentitySink == propagator 1 identity op_label d forget_ksprop
number_of_mesons 1
#== MesonSpectrum == pair 1 0 spectrum_request meson forget_corr r_offset 0 0 0 129 number_of_correlators 1 correlator P5-P5 p000-fine 1 / 124416.0 G5-G5 0 0 0 EO EO EO
#== Baryons ==
number_of_baryons 0
From Peter for Grid
module swap PrgEnv-cray PrgEnv-amd module load craype-accel-amd-gfx90a #module load cray-mpich/8.1.23 module load cmake #module load amd/5.3.0 module load cray-hdf5 module load cray-fftw module load gmp module load emacs module unload cray-libsci module list
These must be set before running
export TOPDIR_HIP=~/frontier/quda export SRCROOT=${TOPDIR_HIP} export BUILDROOT=${TOPDIR_HIP} export INSTALLROOT=${TOPDIR_HIP}/install export TARGET_GPU=gfx90a
#GTL_ROOT=$PE_MPICH_GTL_DIR_amd_gfx90a GTL_ROOT=/opt/cray/pe/mpich/8.1.25/gtl/lib
MPI_CFLAGS="-I${MPICH_DIR}/include -g" MPI_LDFLAGS="-g -Wl,-rpath=${MPICH_DIR}/lib -L${MPICH_DIR}/lib -lmpi -L${GTL_ROOT} -Wl,-rpath=${GTL_ROOT} -lmpi_gtl_hsa"
export PK_BUILD_TYPE="Release"
export PATH=${ROCM_PATH}/bin:${ROCM_PATH}/llvm/bin:${PATH}
QIOLIB=${INSTALLROOT}/qio/lib QMPLIB=${INSTALLROOT}/qmp/lib export LD_LIBRARY_PATH=${INSTALLROOT}/quda/lib:${QMPLIB}:${QIOLIB}:${ROCM_PATH}/llvm/lib64:${ROCM_PATH}/llvm/lib:${MPICH_DIR}/lib:${GTL_ROOT}:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH=/opt/cray/pe/gcc/mpfr/3.1.4/lib:${LD_LIBRARY_PATH}
end
Thank you, Carleton. I'm sorry that I haven't had a chance to test this yet, but I'll be able to on Monday; the requisite scripts are essentially ready to go.
Just an update, as a quick test I saw if I could reproduce the behavior on a smaller lattice (64^3x96) and I was unsuccessful, so now I'm going to try a configuration from an ensemble with the same global volume, beta, quark masses, etc. I'll keep you updated.
I may have found the issue, will post back soon.
I've reproduced the behavior on a 144^3 configuration, for both an odd and even source. It seems like the logic for heavy quark residual reliable updates is breaking down in ways that it wasn't on smaller configurations (i.e., 64^3, 96^3). We have a call on Wednesday and we'll figure out a solution.
Hi Evan,
Any progress?
Thanks,
Carleton
On 5/16/23 9:03 AM, Evan Weinberg wrote:
I've reproduced the behavior on a 144^3 configuration, for both an odd /and/ even source. It seems like the logic for heavy quark residual reliable updates is breaking down in ways that it wasn't on smaller configurations (i.e., 64^3, 96^3). We have a call on Wednesday and we'll figure out a solution.
— Reply to this email directly, view it on GitHub https://github.com/lattice/quda/issues/1376#issuecomment-1549851977, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP6HXWIVLGLKQEV6E4E3X3XGOJLVANCNFSM6AAAAAAXRMO3NA. You are receiving this because you authored the thread.Message ID: @.***>
I've put together a fix for the heavy quark convergence issue that at least works in the cases where I've been able to trigger the issue. Can you please test it for your case, @detar ? The code is in the branch hotfix/heavy-quark-restart. Once you've confirmed it works I'll get the ball rolling on a formal PR into develop
.