Non-serial versions of tests using `5x5_amazon` failing `RUN`
Brief summary of bug
mpibind seems to have an issue with 5x5_amazon resolutions when run with full mpi (i.e. no MPI-serial) since ctsm5.1.dev173. Originally posted at https://github.com/NCAR/mpibind/issues/5.
General bug information
CTSM version you are using: ctsm5.1.dev173
Does this bug cause significantly incorrect results in the model's science? [Yes / No] Run fails so no assessment possible
Details of bug
This was discovered when running the FatesColdSeedDispersal test while generating new fates baselines for the dev173 update. I was able to also replicate this failure using a non-serial MPI version of the hillslope clm-only test. The run immediately fails producing a cesm.log entry with a note about one of the core selections being invalid (see below). It also produced an mpibind.log that I hadn't noticed before.
This prompted me to compare dev172 and dev173 runs for non-serial MPI versions of the hillslope test that use 5x5_amazon. The dev172 version passes, but I noticed that the preview_run output is different:
dev172:
MPIRUN (job=case.test):
mpiexec --label --line-buffer -n 5 /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope/bld/cesm.exe >> cesm.log.$LID 2>&1
dev173:
MPIRUN (job=case.test):
mpibind --label --line-buffer -- /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope-dev173/bld/cesm.exe >> cesm.log.$LID 2>&1
What is odd to me is that mpibind was brought in dev172 via ccs_config_cesm0.0.92, so why is the call not activated for that tag? Why is it only being invoked with dev173?
Important details of your setup / configuration so we can reproduce the bug
You can view the SRCROOT_GIT_STATUS files for both dev173 and dev172 hillslope runs here, respectively:
/glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173
/glade/u/home/glemieux/scratch/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope
Important output or errors that show the problem
cesm.log
1 dec0417.hsn.de.hpc.ucar.edu 4: <65-65> is invalid
2 dec0417.hsn.de.hpc.ucar.edu 4: libnuma: Warning: cpu argument 65-65 is out of range
3 dec0417.hsn.de.hpc.ucar.edu 4:
4 dec0417.hsn.de.hpc.ucar.edu 4: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
5 dec0417.hsn.de.hpc.ucar.edu 4: [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
6 dec0417.hsn.de.hpc.ucar.edu 4: [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
7 dec0417.hsn.de.hpc.ucar.edu 4: [--localalloc | -l] command args ...
8 dec0417.hsn.de.hpc.ucar.edu 4: numactl [--show | -s]
9 dec0417.hsn.de.hpc.ucar.edu 4: numactl [--hardware | -H]
10 dec0417.hsn.de.hpc.ucar.edu 4: numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
11 dec0417.hsn.de.hpc.ucar.edu 4: [--strict | -t]
12 dec0417.hsn.de.hpc.ucar.edu 4: [--shmid | -I <id>] --shm | -S <shmkeyfile>
13 dec0417.hsn.de.hpc.ucar.edu 4: [--shmid | -I <id>] --file | -f <tmpfsfile>
14 dec0417.hsn.de.hpc.ucar.edu 4: [--huge | -u] [--touch | -T]
15 dec0417.hsn.de.hpc.ucar.edu 4: memory policy [--dump | -d] [--dump-nodes | -D]
16 dec0417.hsn.de.hpc.ucar.edu 4:
17 dec0417.hsn.de.hpc.ucar.edu 4: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
18 dec0417.hsn.de.hpc.ucar.edu 4: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
19 dec0417.hsn.de.hpc.ucar.edu 4: Instead of a number a node can also be:
20 dec0417.hsn.de.hpc.ucar.edu 4: netdev:DEV the node connected to network device DEV
21 dec0417.hsn.de.hpc.ucar.edu 4: file:PATH the node the block device of path is connected to
22 dec0417.hsn.de.hpc.ucar.edu 4: ip:HOST the node of the network device host routes through
23 dec0417.hsn.de.hpc.ucar.edu 4: block:PATH the node of block device path
24 dec0417.hsn.de.hpc.ucar.edu 4: pci:[seg:]bus:dev[:func] The node of a PCI device
25 dec0417.hsn.de.hpc.ucar.edu 4: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
26 dec0417.hsn.de.hpc.ucar.edu 4: all ranges can be inverted with !
27 dec0417.hsn.de.hpc.ucar.edu 4: all numbers and ranges can be made cpuset-relative with +
28 dec0417.hsn.de.hpc.ucar.edu 4: the old --cpubind argument is deprecated.
29 dec0417.hsn.de.hpc.ucar.edu 4: use --cpunodebind or --physcpubind instead
30 dec0417.hsn.de.hpc.ucar.edu 4: use --balancing | -b to enable Linux kernel NUMA balancing
31 dec0417.hsn.de.hpc.ucar.edu 4: for the process if it is supported by kernel
32 dec0417.hsn.de.hpc.ucar.edu 4: <length> can have g (GB), m (MB) or k (KB) suffixes
33 dec0417.hsn.de.hpc.ucar.edu 3: <64-64> is invalid
34 dec0417.hsn.de.hpc.ucar.edu 3: libnuma: Warning: cpu argument 64-64 is out of range
35 dec0417.hsn.de.hpc.ucar.edu 3:
36 dec0417.hsn.de.hpc.ucar.edu 3: usage: numactl [--all | -a] [--balancing | -b] [--interleave= | -i <nodes>]
37 dec0417.hsn.de.hpc.ucar.edu 3: [--preferred= | -p <node>] [--physcpubind= | -C <cpus>]
38 dec0417.hsn.de.hpc.ucar.edu 3: [--cpunodebind= | -N <nodes>] [--membind= | -m <nodes>]
39 dec0417.hsn.de.hpc.ucar.edu 3: [--localalloc | -l] command args ...
40 dec0417.hsn.de.hpc.ucar.edu 3: numactl [--show | -s]
41 dec0417.hsn.de.hpc.ucar.edu 3: numactl [--hardware | -H]
42 dec0417.hsn.de.hpc.ucar.edu 3: numactl [--length | -L <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
43 dec0417.hsn.de.hpc.ucar.edu 3: [--strict | -t]
44 dec0417.hsn.de.hpc.ucar.edu 3: [--shmid | -I <id>] --shm | -S <shmkeyfile>
45 dec0417.hsn.de.hpc.ucar.edu 3: [--shmid | -I <id>] --file | -f <tmpfsfile>
46 dec0417.hsn.de.hpc.ucar.edu 3: [--huge | -u] [--touch | -T]
47 dec0417.hsn.de.hpc.ucar.edu 3: memory policy [--dump | -d] [--dump-nodes | -D]
48 dec0417.hsn.de.hpc.ucar.edu 3:
dec0417.hsn.de.hpc.ucar.edu 3: memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
50 dec0417.hsn.de.hpc.ucar.edu 3: <nodes> is a comma delimited list of node numbers or A-B ranges or all.
51 dec0417.hsn.de.hpc.ucar.edu 3: Instead of a number a node can also be:
52 dec0417.hsn.de.hpc.ucar.edu 3: netdev:DEV the node connected to network device DEV
53 dec0417.hsn.de.hpc.ucar.edu 3: file:PATH the node the block device of path is connected to
54 dec0417.hsn.de.hpc.ucar.edu 3: ip:HOST the node of the network device host routes through
55 dec0417.hsn.de.hpc.ucar.edu 3: block:PATH the node of block device path
56 dec0417.hsn.de.hpc.ucar.edu 3: pci:[seg:]bus:dev[:func] The node of a PCI device
57 dec0417.hsn.de.hpc.ucar.edu 3: <cpus> is a comma delimited list of cpu numbers or A-B ranges or all
58 dec0417.hsn.de.hpc.ucar.edu 3: all ranges can be inverted with !
59 dec0417.hsn.de.hpc.ucar.edu 3: all numbers and ranges can be made cpuset-relative with +
60 dec0417.hsn.de.hpc.ucar.edu 3: the old --cpubind argument is deprecated.
61 dec0417.hsn.de.hpc.ucar.edu 3: use --cpunodebind or --physcpubind instead
62 dec0417.hsn.de.hpc.ucar.edu 3: use --balancing | -b to enable Linux kernel NUMA balancing
63 dec0417.hsn.de.hpc.ucar.edu 3: for the process if it is supported by kernel
64 dec0417.hsn.de.hpc.ucar.edu 3: <length> can have g (GB), m (MB) or k (KB) suffixes
65 dec0417.hsn.de.hpc.ucar.edu: rank 3 exited with code 1
66 dec0417.hsn.de.hpc.ucar.edu: rank 0 died from signal 15
mpibind.log
Chunk info
1:ncpus=5:mpiprocs=5:ompthreads=1:mem=230GB:Qlist=cpu:ngpus=0
-- -- -- --
MPI exec line:
mpiexec --label --line-buffer -n 5 -ppn 5 --cpu-bind none -env OMP_NUM_THREADS=1 /glade/u/apps/opt/mpitools/mpibind/cpu_bind /glade/derecho/scratch/glemieux/ctsm-tests/tests_mpi-nonserial-check-clm_hillslope-dev173/SMS_D_Ld5.5x5_amazon.I1850Clm51Bgc.derecho_gnu.clm-HillslopeC.mpi-nonserial-check-clm_hillslope-dev173/bld/cesm.exe
-- -- -- --
Binding Report:
rank: 0, cores: 0-0
rank: 1, cores: 1-1
rank: 3, cores: 64-64
rank: 4, cores: 65-65
@ekluzek given the feedback from https://github.com/NCAR/mpibind/issues/5#issuecomment-1998714383, should I make an issue in the ccs_config_cesm repo?
@glemieux yes go ahead and do that.
During the ctsm stand-up meeting today we came up with the following actions for the time being:
- [x] Add a non-serial
5x5_amazontest toaux_clmonderechoand to the expected failure list referencing this issue. - [x] Temporarily convert the
FatesColdSeedDisptestmod to run onf10
It was also noted that this doesn't seem to be an issue for izumi
@glemieux note this also relates to another problem I ran into:
https://github.com/ESCOMP/CTSM/pull/2427#issuecomment-2016048650
where the new use of mpibind needed me to do something different for mksurfdata_esmf.
The ccs_config issue is here:
https://github.com/ESMCI/ccs_config_cesm/issues/142
During the ctsm stand-up meeting today we came up with the following actions for the time being:
- [x] Add a non-serial
5x5_amazontest toaux_clmonderechoand to the expected failure list referencing this issue.- [x] Temporarily convert the
FatesColdSeedDisptestmod to run onf10It was also noted that this doesn't seem to be an issue for
izumi
Completed these actions items per #2436.
It seems like the non-serial 5x5_amazon test (SMS_D_Ld5.5x5_amazon.I1850Clm60Bgc.derecho_gnu.clm-HillslopeC) is now passing as of ctsm5.2.027. Should this issue be closed and that test removed from the expected failure list?