CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

Non-serial versions of tests using `5x5_amazon` failing `RUN`

Open glemieux opened this issue 1 year ago • 6 comments

Brief summary of bug

mpibind seems to have an issue with 5x5_amazon resolutions when run with full mpi (i.e. no MPI-serial) since ctsm5.1.dev173. Originally posted at https://github.com/NCAR/mpibind/issues/5.

General bug information

CTSM version you are using: ctsm5.1.dev173

Does this bug cause significantly incorrect results in the model's science? [Yes / No] Run fails so no assessment possible

Details of bug

This was discovered when running the FatesColdSeedDispersal test while generating new fates baselines for the dev173 update. I was able to also replicate this failure using a non-serial MPI version of the hillslope clm-only test. The run immediately fails producing a cesm.log entry with a note about one of the core selections being invalid (see below). It also produced an mpibind.log that I hadn't noticed before.

This prompted me to compare dev172 and dev173 runs for non-serial MPI versions of the hillslope test that use 5x5_amazon. The dev172 version passes, but I noticed that the preview_run output is different:

dev172:

    MPIRUN (job=case.test):
      mpiexec  --label  --line-buffer  -n 5 /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope/bld/cesm.exe   >> cesm.log.$LID 2>&1 

dev173:

    MPIRUN (job=case.test):
      mpibind  --label  --line-buffer  --  /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope-dev173/bld/cesm.exe   >> cesm.log.$LID 2>&1 

What is odd to me is that mpibind was brought in dev172 via ccs_config_cesm0.0.92, so why is the call not activated for that tag? Why is it only being invoked with dev173?

Important details of your setup / configuration so we can reproduce the bug

You can view the SRCROOT_GIT_STATUS files for both dev173 and dev172 hillslope runs here, respectively: /glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173 /glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope

Important output or errors that show the problem

cesm.log

  1 dec0417.hsn.de.hpc.ucar.edu 4: <65-65> is invalid
  2 dec0417.hsn.de.hpc.ucar.edu 4: libnuma: Warning: cpu argument 65-65 is out of range
  3 dec0417.hsn.de.hpc.ucar.edu 4:
  4 dec0417.hsn.de.hpc.ucar.edu 4: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
  5 dec0417.hsn.de.hpc.ucar.edu 4:                [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
  6 dec0417.hsn.de.hpc.ucar.edu 4:                [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
  7 dec0417.hsn.de.hpc.ucar.edu 4:                [--localalloc | -l] command args ...
  8 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--show | -s]
  9 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--hardware | -H]
 10 dec0417.hsn.de.hpc.ucar.edu 4:        numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
 11 dec0417.hsn.de.hpc.ucar.edu 4:                [--strict | -t]
 12 dec0417.hsn.de.hpc.ucar.edu 4:                [--shmid | -I <id>] --shm | -S <shmkeyfile>
 13 dec0417.hsn.de.hpc.ucar.edu 4:                [--shmid | -I <id>] --file | -f <tmpfsfile>
 14 dec0417.hsn.de.hpc.ucar.edu 4:                [--huge | -u] [--touch | -T]
 15 dec0417.hsn.de.hpc.ucar.edu 4:                memory policy [--dump | -d] [--dump-nodes | -D]
 16 dec0417.hsn.de.hpc.ucar.edu 4:
 17 dec0417.hsn.de.hpc.ucar.edu 4: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
 18 dec0417.hsn.de.hpc.ucar.edu 4: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
 19 dec0417.hsn.de.hpc.ucar.edu 4: Instead of a number a node can also be:
 20 dec0417.hsn.de.hpc.ucar.edu 4:   netdev:DEV the node connected to network device DEV
 21 dec0417.hsn.de.hpc.ucar.edu 4:   file:PATH  the node the block device of path is connected to
 22 dec0417.hsn.de.hpc.ucar.edu 4:   ip:HOST    the node of the network device host routes through
 23 dec0417.hsn.de.hpc.ucar.edu 4:   block:PATH the node of block device path
 24 dec0417.hsn.de.hpc.ucar.edu 4:   pci:[seg:]bus:dev[:func] The node of a PCI device
 25 dec0417.hsn.de.hpc.ucar.edu 4: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
 26 dec0417.hsn.de.hpc.ucar.edu 4: all ranges can be inverted with !
 27 dec0417.hsn.de.hpc.ucar.edu 4: all numbers and ranges can be made cpuset-relative with +
 28 dec0417.hsn.de.hpc.ucar.edu 4: the old --cpubind argument is deprecated.
 29 dec0417.hsn.de.hpc.ucar.edu 4: use --cpunodebind or --physcpubind instead
 30 dec0417.hsn.de.hpc.ucar.edu 4: use --balancing | -b to enable Linux kernel NUMA balancing
 31 dec0417.hsn.de.hpc.ucar.edu 4: for the process if it is supported by kernel
 32 dec0417.hsn.de.hpc.ucar.edu 4: <length> can have g (GB), m (MB) or k (KB) suffixes
 33 dec0417.hsn.de.hpc.ucar.edu 3: <64-64> is invalid
 34 dec0417.hsn.de.hpc.ucar.edu 3: libnuma: Warning: cpu argument 64-64 is out of range
 35 dec0417.hsn.de.hpc.ucar.edu 3:
 36 dec0417.hsn.de.hpc.ucar.edu 3: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
 37 dec0417.hsn.de.hpc.ucar.edu 3:                [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
 38 dec0417.hsn.de.hpc.ucar.edu 3:                [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
 39 dec0417.hsn.de.hpc.ucar.edu 3:                [--localalloc | -l] command args ...
 40 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--show | -s]
 41 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--hardware | -H]
 42 dec0417.hsn.de.hpc.ucar.edu 3:        numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
 43 dec0417.hsn.de.hpc.ucar.edu 3:                [--strict | -t]
 44 dec0417.hsn.de.hpc.ucar.edu 3:                [--shmid | -I <id>] --shm | -S <shmkeyfile>
 45 dec0417.hsn.de.hpc.ucar.edu 3:                [--shmid | -I <id>] --file | -f <tmpfsfile>
 46 dec0417.hsn.de.hpc.ucar.edu 3:                [--huge | -u] [--touch | -T]
 47 dec0417.hsn.de.hpc.ucar.edu 3:                memory policy [--dump | -d] [--dump-nodes | -D]
 48 dec0417.hsn.de.hpc.ucar.edu 3:
dec0417.hsn.de.hpc.ucar.edu 3: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
 50 dec0417.hsn.de.hpc.ucar.edu 3: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
 51 dec0417.hsn.de.hpc.ucar.edu 3: Instead of a number a node can also be:
 52 dec0417.hsn.de.hpc.ucar.edu 3:   netdev:DEV the node connected to network device DEV
 53 dec0417.hsn.de.hpc.ucar.edu 3:   file:PATH  the node the block device of path is connected to
 54 dec0417.hsn.de.hpc.ucar.edu 3:   ip:HOST    the node of the network device host routes through
 55 dec0417.hsn.de.hpc.ucar.edu 3:   block:PATH the node of block device path
 56 dec0417.hsn.de.hpc.ucar.edu 3:   pci:[seg:]bus:dev[:func] The node of a PCI device
 57 dec0417.hsn.de.hpc.ucar.edu 3: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
 58 dec0417.hsn.de.hpc.ucar.edu 3: all ranges can be inverted with !
 59 dec0417.hsn.de.hpc.ucar.edu 3: all numbers and ranges can be made cpuset-relative with +
 60 dec0417.hsn.de.hpc.ucar.edu 3: the old --cpubind argument is deprecated.
 61 dec0417.hsn.de.hpc.ucar.edu 3: use --cpunodebind or --physcpubind instead
 62 dec0417.hsn.de.hpc.ucar.edu 3: use --balancing | -b to enable Linux kernel NUMA balancing
 63 dec0417.hsn.de.hpc.ucar.edu 3: for the process if it is supported by kernel
 64 dec0417.hsn.de.hpc.ucar.edu 3: <length> can have g (GB), m (MB) or k (KB) suffixes
 65 dec0417.hsn.de.hpc.ucar.edu: rank 3 exited with code 1
 66 dec0417.hsn.de.hpc.ucar.edu: rank 0 died from signal 15

mpibind.log

Chunk info
  1:ncpus=5:mpiprocs=5:ompthreads=1:mem=230GB:Qlist=cpu:ngpus=0
-- -- -- --
MPI exec line:
  mpiexec --label --line-buffer -n 5 -ppn 5 --cpu-bind none -env OMP_NUM_THREADS=1 /glade/u/apps/opt/mpitools/mpibind/cpu_bind /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope-dev173/bld/cesm.exe 
-- -- -- --
Binding Report:
rank: 0, cores: 0-0
rank: 1, cores: 1-1
rank: 3, cores: 64-64
rank: 4, cores: 65-65

glemieux avatar Mar 14 '24 23:03 glemieux

@ekluzek given the feedback from https://github.com/NCAR/mpibind/issues/5#issuecomment-1998714383, should I make an issue in the ccs_config_cesm repo?

glemieux avatar Mar 15 '24 19:03 glemieux

@glemieux yes go ahead and do that.

ekluzek avatar Mar 15 '24 20:03 ekluzek

During the ctsm stand-up meeting today we came up with the following actions for the time being:

  • [x] Add a non-serial 5x5_amazon test to aux_clm on derecho and to the expected failure list referencing this issue.
  • [x] Temporarily convert the FatesColdSeedDisp testmod to run on f10

It was also noted that this doesn't seem to be an issue for izumi

glemieux avatar Mar 18 '24 19:03 glemieux

@glemieux note this also relates to another problem I ran into:

https://github.com/ESCOMP/CTSM/pull/2427#issuecomment-2016048650

where the new use of mpibind needed me to do something different for mksurfdata_esmf.

ekluzek avatar Mar 22 '24 22:03 ekluzek

The ccs_config issue is here:

https://github.com/ESMCI/ccs_config_cesm/issues/142

ekluzek avatar Mar 22 '24 23:03 ekluzek

During the ctsm stand-up meeting today we came up with the following actions for the time being:

  • [x] Add a non-serial 5x5_amazon test to aux_clm on derecho and to the expected failure list referencing this issue.
  • [x] Temporarily convert the FatesColdSeedDisp testmod to run on f10

It was also noted that this doesn't seem to be an issue for izumi

Completed these actions items per #2436.

glemieux avatar Mar 25 '24 17:03 glemieux

It seems like the non-serial 5x5_amazon test (SMS_D_Ld5.5x5_amazon.I1850Clm60Bgc.derecho_gnu.clm-HillslopeC) is now passing as of ctsm5.2.027. Should this issue be closed and that test removed from the expected failure list?

samsrabin avatar Sep 19 '24 22:09 samsrabin