E3SM Update Chicoma-CPU and add Chicoma-GPU

Update Chicoma-CPU and add Chicoma-GPU

Open xylar opened this issue 5 months ago • 15 comments

This merge updates support for Chicoma at LANL. It makes a few updates to Chicoma-CPU and adds support for Chicoma's GPU partition.

Further discussion can be seen at: https://github.com/E3SM-Ocean-Discussion/E3SM/pull/73

Feb 09 '24 11:02 xylar

@xylar with my latest commit, gnugpu also works on chicoma-gpu for

./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu --walltime 1:00:00 --wait -p g23_nonhydro_g

Feb 09 '24 17:02 vanroekel

@vanroekel, thanks so much! With you latest changes, are you ready to approve this PR?

@mark-petersen and @jonbob, could you review when you have time?

Feb 10 '24 13:02 xylar

CPU passes:

./create_test SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_gnu -p t24_coastal_ocean --walltime 00:30:00

But chicoma-gpu_nvidiagpu fails:

./create_test SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu -p t24_coastal_ocean --walltime 00:30:00

with compiler library error:

/lustre/scratch4/turquoise/.mdt3/mpeterse/E3SM/scratch/chicoma-gpu/SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_nvidiagpu.20240213_093527_4f4wcu/bld/cmake-bld/mpas-framework/src/tools/parse: 

error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

gmake[2]: *** [mpas-framework/src/CMakeFiles/ocn.dir/build.make:678: core_ocean/inc/core_variables.inc] Error 127

Is that expected with nvidiagpu? I'll try chicoma-gpu_gnugpu next.

Feb 13 '24 18:02 mark-petersen

I tried the same command as Luke above and I get the same error,

error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

I am just compiling on a chicoma front-end node, with no extra modules loaded. Is that what you did?

Feb 13 '24 19:02 mark-petersen

@vanroekel, was the final solution to run ./create_test on GPU compute nodes? I know that was something you tried but I wasn't sure if that was the final answer.

@mark-petersen, can you try that?

Feb 13 '24 19:02 xylar

yes exactly right, GPU tests won't build on the login nodes.

Feb 13 '24 21:02 vanroekel

Thanks. I can log into a gpu node, then it builds correctly:

salloc -N 1 -t 2:0:0 --qos=debug --reservation=debug --account=g23_nonhydro_g
cd /usr/projects/climate/mpeterse/repos/E3SM/pr/cime/scripts
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu --walltime 1:00:00 --wait -p g23_nonhydro_g

so far so good. But the run step dies.

cd /lustre/scratch4/turquoise/mpeterse/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu.20240214_141147_88vxab
tail test.SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu.20240214_141147_88vxab.10341119
...
2024-02-14 14:16:19 MODEL EXECUTION HAS FINISHED
ERROR: RUN FAIL: Command 'srun  --label  -n 64 -N 16 -c 64  --cpu_bind=cores  -m plane=4 /lustre/scratch4/turquoise/mpeterse/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu.20240214_141147_88vxab/bld/e3sm.exe   >> e3sm.log.$LID 2>&1 ' failed

the error message is:

cat run/e3sm.log.10341119.240214-141614
srun: error: Unable to create step for job 10341119: More processors requested than permitted

It looks like something is wrong with the srun flags -n 64 -N 16 -c 64

note that srun flag

-c, --cpus-per-task=<ncpus>
              Request that ncpus be allocated per process.

so I'm not even sure we are running on GPUs here.

Luke, did your command above work with no alterations? I don't understand why mine would be different.

Feb 14 '24 22:02 mark-petersen

@vanroekel, I believe this is waiting on a response from you.

Feb 27 '24 10:02 xylar

@mark-petersen

salloc -N 1 -t 2:0:0 --qos=debug --reservation=debug --account=g23_nonhydro_g

I could be wrong but I don't think this command is sufficient to get you onto the GPU nodes. Perhaps that's the problem. @vanroekel, can you more explicitly list the steps that you took both to get an interactive node and to run the test?

Feb 27 '24 10:02 xylar

Sorry for the delay. I missed this in my emails. I had to log into the GPU node to build, but then had to drop out of the GPU node to submit the test for running. But let me try double check that today. I'll give it a try after my next meeting

Feb 27 '24 15:02 vanroekel

Yes, on chicoma, there is a separate gpu-debug partition and gpu-debug reservation arguments that are needed as part of the sallloc or srun for debugging on the gpu partition

Feb 27 '24 16:02 philipwjones

Well shoot, I thought I had this working, but am seeing the same error as @mark-petersen. I switched to the salloc command with gpu_debug for partition and reservation and still get the error. @philipwjones do you have any other suggestions?

Feb 27 '24 20:02 vanroekel

The -c is the number of cores per MPI task so should be the number of threads if this is a threaded run, otherwise should be -c 1. Sorry I didn't catch that earlier

Feb 27 '24 21:02 philipwjones

Also, for gpu runs, you might have better luck using: --accel-bind=g --cpu-bind=rank_ldom And pick the number of CPUs equal to the GPU count. The rank_ldom makes sure the cores are divided evenly across NUMA domains. If you oversubscribe cpus to gpus, you may need to launch the multi-process server (MPS) daemon - that's what we have to do on the similar pm-gpu partitions.

Feb 27 '24 21:02 philipwjones

To launch MPS daemon, add: nvidia-cuda-mps-control -d to the batch script. Haven't tried that on chicoma so don't know if they have it enabled. But that's how to start it on pm.

Feb 27 '24 21:02 philipwjones

chicoma scratch4 is now read-only. Should we change this at the same time?

4277   <machine MACH="chicoma-cpu">
...
4283     <CIME_OUTPUT_ROOT>/lustre/scratch4/turquoise/$ENV{USER}/E3SM/scratch/chicoma-cpu</CIME_OUTPUT_ROOT>
4286     <DOUT_S_ROOT>/lustre/scratch4/turquoise/$ENV{USER}/E3SM/archive/$CASE</DOUT_S_ROOT>
4287     <BASELINE_ROOT>/lustre/scratch4/turquoise/$ENV{USER}/E3SM/input_data/ccsm_baselines/$COMPILER</BASELINE_ROOT>

Mar 03 '24 17:03 mark-petersen

@mark-petersen , yes please push changes to my fork to move to scratch5.

Mar 03 '24 19:03 xylar

Updated scratch4/turquoise to scratch5. Tested with:

./create_test SMS_D.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-cpu_gnu -p t24_coastal_ocean --walltime 00:30:00

and it creates a case directory here: /lustre/scratch5/mpeterse/E3SM/scratch/chicoma-cpu/ and passes.

I also tried on gpu:

salloc -N 1 -t 2:0:0 --qos=debug --reservation=debug --account=g23_nonhydro_g
./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu --walltime 1:00:00 --wait -p g23_nonhydro_g

and it makes a case directory in /lustre/scratch5/mpeterse/E3SM/scratch/chicoma-gpu/ and compiles but does not get through the run step.

It appears to hang with:

./create_test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu --walltime 1:00:00 --wait -p g23_nonhydro_g
create_test will do up to 1 tasks simultaneously
create_test will use up to 160 cores simultaneously
Creating test directory /lustre/scratch5/mpeterse/E3SM/scratch/chicoma-gpu/SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu.20240303_141934_qn8vgw
RUNNING TESTS:
  SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu
Starting CREATE_NEWCASE for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 procs
Finished CREATE_NEWCASE for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 1.485376 seconds (PASS)
Starting XML for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 procs
Finished XML for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 0.334573 seconds (PASS)
Starting SETUP for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 procs
Finished SETUP for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 7.473550 seconds (PASS)
Starting SHAREDLIB_BUILD for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 procs
Finished SHAREDLIB_BUILD for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 104.659378 seconds (PASS)
Starting MODEL_BUILD for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 7 procs
Finished MODEL_BUILD for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 192.617112 seconds (PASS)
Starting RUN for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu with 1 proc on interactive node and 64 procs on compute nodes
Finished RUN for test SMS.T62_oQU120_ais20.MPAS_LISIO_TEST.chicoma-gpu_gnugpu in 4.229908 seconds (PEND). [COMPLETED 1 of 1]
Waiting for tests to finish

and not actually launch.

I also pushed a deletion of grizzly and badger (farewell!).

Mar 03 '24 21:03 mark-petersen

@mark-petersen -- I would have preferred that cleaning up old machines happen separately, since it is a separate issue. But we can change the scope and title of this PR if necessary

Mar 04 '24 20:03 jonbob

Removed the last two commits, which deleted badger and grizzly. Will make that in another PR for clarity. Thanks @jonbob for that suggestion.

Mar 05 '24 16:03 mark-petersen

passes sanity testing on lcrc:

ERS_Ld5.T62_oQU120.CMPASO-NYF.chrysalis_intel
ERP_Ld3.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.chrysalis_intel.allactive-pioroot1

merged to next

Mar 12 '24 19:03 jonbob

merged to master

Mar 13 '24 18:03 jonbob

Thanks @jonbob! And @vanroekel and @mark-petersen for helping me so much on this branch!

Mar 13 '24 20:03 xylar

E3SM E3SM copied to clipboard

Update Chicoma-CPU and add Chicoma-GPU

E3SM
E3SM copied to clipboard