E3SM
E3SM copied to clipboard
Update Chicoma-CPU and add Chicoma-GPU
This merge updates support for Chicoma at LANL. It makes a few updates to Chicoma-CPU and adds support for Chicoma's GPU partition.
Further discussion can be seen at: https://github.com/E3SM-Ocean-Discussion/E3SM/pull/73
@xylar with my latest commit, gnugpu
also works on chicoma-gpu for
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu --walltime 1:00:00 --wait -p g23_nonhydro_g
@vanroekel, thanks so much! With you latest changes, are you ready to approve this PR?
@mark-petersen and @jonbob, could you review when you have time?
CPU passes:
./create_test SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_gnu -p t24_coastal_ocean --walltime 00:30:00
But chicoma-gpu_nvidiagpu
fails:
./create_test SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu -p t24_coastal_ocean --walltime 00:30:00
with compiler library error:
/lustre/scratch4/turquoise/.mdt3/mpeterse/E3SM/scratch/chicoma-gpu/SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu.20240213_093527_4f4wcu/bld/cmake-bld/mpas-framework/src/tools/parse:
error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
gmake[2]: *** [mpas-framework/src/CMakeFiles/ocn.dir/build.make:678: core_ocean/inc/core_variables.inc] Error 127
Is that expected with nvidiagpu
? I'll try chicoma-gpu_gnugpu
next.
I tried the same command as Luke above and I get the same error,
error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
I am just compiling on a chicoma front-end node, with no extra modules loaded. Is that what you did?
@vanroekel, was the final solution to run ./create_test
on GPU compute nodes? I know that was something you tried but I wasn't sure if that was the final answer.
@mark-petersen, can you try that?
yes exactly right, GPU tests won't build on the login nodes.
Thanks. I can log into a gpu node, then it builds correctly:
salloc -N 1 -t 2:0:0 --qos=debug --reservation=debug --account=g23_nonhydro_g
cd /usr/projects/climate/mpeterse/repos/E3SM/pr/cime/scripts
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu --walltime 1:00:00 --wait -p g23_nonhydro_g
so far so good. But the run step dies.
cd /lustre/scratch4/turquoise/mpeterse/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu.20240214_141147_88vxab
tail test.SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu.20240214_141147_88vxab.10341119
...
2024-02-14 14:16:19 MODEL EXECUTION HAS FINISHED
ERROR: RUN FAIL: Command 'srun --label -n 64 -N 16 -c 64 --cpu_bind=cores -m plane=4 /lustre/scratch4/turquoise/mpeterse/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu.20240214_141147_88vxab/bld/e3sm.exe >> e3sm.log.$LID 2>&1 ' failed
the error message is:
cat run/e3sm.log.10341119.240214-141614
srun: error: Unable to create step for job 10341119: More processors requested than permitted
It looks like something is wrong with the srun flags -n 64 -N 16 -c 64
note that srun flag
-c, --cpus-per-task=<ncpus>
Request that ncpus be allocated per process.
so I'm not even sure we are running on GPUs here.
Luke, did your command above work with no alterations? I don't understand why mine would be different.
@vanroekel, I believe this is waiting on a response from you.
@mark-petersen
salloc -N 1 -t 2:0:0 --qos=debug --reservation=debug --account=g23_nonhydro_g
I could be wrong but I don't think this command is sufficient to get you onto the GPU nodes. Perhaps that's the problem. @vanroekel, can you more explicitly list the steps that you took both to get an interactive node and to run the test?
Sorry for the delay. I missed this in my emails. I had to log into the GPU node to build, but then had to drop out of the GPU node to submit the test for running. But let me try double check that today. I'll give it a try after my next meeting
Yes, on chicoma, there is a separate gpu-debug partition and gpu-debug reservation arguments that are needed as part of the sallloc or srun for debugging on the gpu partition
Well shoot, I thought I had this working, but am seeing the same error as @mark-petersen. I switched to the salloc command with gpu_debug for partition and reservation and still get the error. @philipwjones do you have any other suggestions?
The -c is the number of cores per MPI task so should be the number of threads if this is a threaded run, otherwise should be -c 1. Sorry I didn't catch that earlier
Also, for gpu runs, you might have better luck using: --accel-bind=g --cpu-bind=rank_ldom And pick the number of CPUs equal to the GPU count. The rank_ldom makes sure the cores are divided evenly across NUMA domains. If you oversubscribe cpus to gpus, you may need to launch the multi-process server (MPS) daemon - that's what we have to do on the similar pm-gpu partitions.
To launch MPS daemon, add: nvidia-cuda-mps-control -d to the batch script. Haven't tried that on chicoma so don't know if they have it enabled. But that's how to start it on pm.
chicoma scratch4 is now read-only. Should we change this at the same time?
4277 <machine MACH="chicoma-cpu">
...
4283 <CIME_OUTPUT_ROOT>/lustre/scratch4/turquoise/$ENV{USER}/E3SM/scratch/chicoma-cpu</CIME_OUTPUT_ROOT>
4286 <DOUT_S_ROOT>/lustre/scratch4/turquoise/$ENV{USER}/E3SM/archive/$CASE</DOUT_S_ROOT>
4287 <BASELINE_ROOT>/lustre/scratch4/turquoise/$ENV{USER}/E3SM/input_data/ccsm_baselines/$COMPILER</BASELINE_ROOT>
@mark-petersen , yes please push changes to my fork to move to scratch5.
Updated scratch4/turquoise
to scratch5
. Tested with:
./create_test SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_gnu -p t24_coastal_ocean --walltime 00:30:00
and it creates a case directory here: /lustre/scratch5/mpeterse/E3SM/scratch/chicoma-cpu/
and passes.
I also tried on gpu:
salloc -N 1 -t 2:0:0 --qos=debug --reservation=debug --account=g23_nonhydro_g
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu --walltime 1:00:00 --wait -p g23_nonhydro_g
and it makes a case directory in /lustre/scratch5/mpeterse/E3SM/scratch/chicoma-gpu/
and compiles but does not get through the run step.
It appears to hang with:
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu --walltime 1:00:00 --wait -p g23_nonhydro_g
create_test will do up to 1 tasks simultaneously
create_test will use up to 160 cores simultaneously
Creating test directory /lustre/scratch5/mpeterse/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu.20240303_141934_qn8vgw
RUNNING TESTS:
SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu
Starting CREATE_NEWCASE for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 procs
Finished CREATE_NEWCASE for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 1.485376 seconds (PASS)
Starting XML for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 procs
Finished XML for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 0.334573 seconds (PASS)
Starting SETUP for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 procs
Finished SETUP for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 7.473550 seconds (PASS)
Starting SHAREDLIB_BUILD for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 procs
Finished SHAREDLIB_BUILD for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 104.659378 seconds (PASS)
Starting MODEL_BUILD for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 7 procs
Finished MODEL_BUILD for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 192.617112 seconds (PASS)
Starting RUN for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 proc on interactive node and 64 procs on compute nodes
Finished RUN for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 4.229908 seconds (PEND). [COMPLETED 1 of 1]
Waiting for tests to finish
and not actually launch.
I also pushed a deletion of grizzly and badger (farewell!).
@mark-petersen -- I would have preferred that cleaning up old machines happen separately, since it is a separate issue. But we can change the scope and title of this PR if necessary
Removed the last two commits, which deleted badger and grizzly. Will make that in another PR for clarity. Thanks @jonbob for that suggestion.
passes sanity testing on lcrc:
- ERS_Ld5.T62_oQU120.CMPASO-NYF.chrysalis_intel
- ERP_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.allactive-pioroot1
merged to next
merged to master
Thanks @jonbob! And @vanroekel and @mark-petersen for helping me so much on this branch!