flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

mvapich2-tce: Error parsing CPU mapping string

Open grondo opened this issue 2 years ago • 3 comments

A user was seeing this error from mvapich2-2.3.6:

$ flux mini run -N2 ./hello 
0.627s: job.exception type=exec severity=0 MPI_Abort: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........:  MPID_Init(400)...............:  MPIDI_CH3I_set_affinity(3474):  smpi_setaffinity(2674).......: Error parsing CPU mapping string 
[corona251:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
Error parsing CPU mapping string
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in smpi_setaffinity:2674

It only seems to occur when at least 2 nodes are involved:

$ flux mini run -n2 ./hello 
0: completed MPI_Init in 1.300s.  There are 2 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.002s

My own environment does not reproduce this issue, but I failed to determine why.

A workaround is to set MV2_ENABLE_AFFINITY=0

$ flux mini run --env=MV2_ENABLE_AFFINITY=0 -N2 ./hello 
[corona251:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
0: completed MPI_Init in 2.553s.  There are 2 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.004s

Just noting this in an issue so it is searchable for the next person that hits this issue.

grondo avatar Jun 14 '22 22:06 grondo

I've run into this on our Flux system instance of RZAlastor. This is something I could see some of our Flux/ATS users wanting to use for their testing workflows.

As a note to any user that runs into this, another possible workaround is using OpenMPI modules located in module use /opt/toss/modules/modulefiles and loaded with module load intel openmpi-intel. Thanks to @ryanday36 for this suggestion on Slack. flux-framework/flux-pmix#40 still shows up, although that's a completely separate issue.

wihobbs avatar Jul 07 '22 19:07 wihobbs

Confirmed this issue on corona (TOSS 4)

ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ module list

Currently Loaded Modules:
  1) intel-tce/19.0.4   2) StdEnv (S)   3) mvapich2-tce/2.3.6

ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ flux mini run -N2 ./hello
[corona282:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
Error parsing CPU mapping string
INTERNAL ERROR: invalid error code ffffffff (Ring Index out of range) in smpi_setaffinity:2675
Error parsing CPU mapping string
0.182s: job.exception type=exec severity=0 MPI_Abort: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)........:  MPID_Init(400)...............:  MPIDI_CH3I_set_affinity(3474):  smpi_setaffinity(2675).......: Error parsing CPU mapping string 
flux-job: task(s) exited with exit code 143

And the workaround above gets past the inital problem but then fails here:

ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ flux mini run --env=MV2_ENABLE_AFFINITY=0 -N2 ./hello
[corona282:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
0.260s: job.exception type=exec severity=0 MPI_Abort: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)....:  MPID_Init(419)...........: channel initialization failed MPIDI_CH3_Init(550)......:  MPIDI_CH3I_RDMA_init(446):  rdma_iba_hca_init(1636)..: cannot create cq 
0.261s: job.exception type=exec severity=0 MPI_Abort: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)....:  MPID_Init(419)...........: channel initialization failed MPIDI_CH3_Init(550)......:  MPIDI_CH3I_RDMA_init(446):  rdma_iba_hca_init(1636)..: cannot create cq 
flux-job: task(s) exited with exit code 1

garlick avatar Jul 29 '22 21:07 garlick

One note: this issue reproduces on corona but not on fluke.

garlick avatar Aug 04 '22 21:08 garlick

This was resolved by configuring mvapich with the option --enable-llnl-site-specific-options (if you can believe it), which, like setting MV2_ENABLE_AFFINITY=0, disables affinity.

garlick avatar Aug 26 '22 13:08 garlick