cctbx_project
cctbx_project copied to clipboard
So how many GPUs are on a Perlmutter GPU node again?
Perlmutter GPU node features 1x AMD EPYC 7763 CPU and 4x NVIDIA A100 GPUs (link). Therefore, it would be reasonable to assume that when running scripts which utilize CUDA or KOKKOS, environment variable CCTBX_GPUS_PER_NODE
should be set to 4. To my surprise, I discovered today that setting it to anything but 1 causes a CUDA assertion error:
`GPUassert: invalid device ordinal /global/cfs/cdirs/m3562/users/dtchon/p20231/alcc-recipes2/cctbx/modules/cctbx_project/simtbx/diffBragg/src/diffBraggCUDA.cu 70`
This might be an intended behavior, but I found it confusing. Following @JBlaschke suggestion, I made this issue to discuss it.
The issue can be recreated by running the following file: /global/cfs/cdirs/m3562/users/dtchon/p20231/common/ensemble1/SPREAD/8mosaic/debug/mosaic_lastfiles.sh
on a Perlmutter GPU interactive node: salloc -N 1 -J mosaic_int -A m3562_g -C gpu --qos interactive -t 15
. In you don't have an access or a cctbx installation including psii_spread workers, the relevant portion of the file is essentially:
export DIFFBRAGG_USE_CUDA=1
export CUDA_LAUNCH_BLOCKING=1
export NUMEXPR_MAX_THREADS=128
export SLURM_CPU_BIND=cores
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export CCTBX_GPUS_PER_NODE=2
srun -n 32 -c 4 --ntasks-per-gpu=8 cctbx.xfel.merge trial8.phil