cctbx_project icon indicating copy to clipboard operation
cctbx_project copied to clipboard

So how many GPUs are on a Perlmutter GPU node again?

Open Baharis opened this issue 1 year ago • 9 comments

Perlmutter GPU node features 1x AMD EPYC 7763 CPU and 4x NVIDIA A100 GPUs (link). Therefore, it would be reasonable to assume that when running scripts which utilize CUDA or KOKKOS, environment variable CCTBX_GPUS_PER_NODE should be set to 4. To my surprise, I discovered today that setting it to anything but 1 causes a CUDA assertion error:

`GPUassert: invalid device ordinal /global/cfs/cdirs/m3562/users/dtchon/p20231/alcc-recipes2/cctbx/modules/cctbx_project/simtbx/diffBragg/src/diffBraggCUDA.cu 70`

This might be an intended behavior, but I found it confusing. Following @JBlaschke suggestion, I made this issue to discuss it.

The issue can be recreated by running the following file: /global/cfs/cdirs/m3562/users/dtchon/p20231/common/ensemble1/SPREAD/8mosaic/debug/mosaic_lastfiles.sh on a Perlmutter GPU interactive node: salloc -N 1 -J mosaic_int -A m3562_g -C gpu --qos interactive -t 15. In you don't have an access or a cctbx installation including psii_spread workers, the relevant portion of the file is essentially:

export DIFFBRAGG_USE_CUDA=1
export CUDA_LAUNCH_BLOCKING=1
export NUMEXPR_MAX_THREADS=128
export SLURM_CPU_BIND=cores
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
export CCTBX_GPUS_PER_NODE=2
srun -n 32 -c 4 --ntasks-per-gpu=8 cctbx.xfel.merge trial8.phil

Baharis avatar Mar 03 '23 03:03 Baharis