flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

mvapich2-tce: cannot create cq

Open garlick opened this issue 3 years ago • 8 comments

As noted in #4371, this error was encountered on corona (TOSS4):

ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ module list

Currently Loaded Modules:
  1) intel-tce/19.0.4   2) StdEnv (S)   3) mvapich2-tce/2.3.6

ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ flux mini run --env=MV2_ENABLE_AFFINITY=0 -N2 ./hello
[corona282:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
0.260s: job.exception type=exec severity=0 MPI_Abort: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)....:  MPID_Init(419)...........: channel initialization failed MPIDI_CH3_Init(550)......:  MPIDI_CH3I_RDMA_init(446):  rdma_iba_hca_init(1636)..: cannot create cq 
0.261s: job.exception type=exec severity=0 MPI_Abort: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)....:  MPID_Init(419)...........: channel initialization failed MPIDI_CH3_Init(550)......:  MPIDI_CH3I_RDMA_init(446):  rdma_iba_hca_init(1636)..: cannot create cq 
flux-job: task(s) exited with exit code 1

garlick avatar Jul 29 '22 21:07 garlick

Since this is a failure to initialize the hca, I wonder if it might also be a manifestation of flux-framework/flux-pmix#40, where the ulimit on locked memory prevents MPI from pinning its buffers?

garlick avatar Jul 29 '22 21:07 garlick

Oh, good guess. I'm wondering if you could test this theory by adding

LimitMEMLOCK=unlimited

to /etc/systemd/system/flux.service.d/override.conf and restarting the brokers?

Note: untested, the unlimited syntax may not be correct and I didn't verify the path for overrides.

Edit: for this to work, the memlock limit would have to be inherited from flux-broker to IMP to shell to tasks. I think that would be the case, but again, not tested.

grondo avatar Jul 29 '22 21:07 grondo

That stack trace:

MPIR_Init_thread(493)....:
MPID_Init(419)...........: channel initialization failed
MPIDI_CH3_Init(550)......:
MPIDI_CH3I_RDMA_init(446):
rdma_iba_hca_init(1636)..: cannot create cq

I believe boils down to a failed call to ibv_create_cq(3) although the code seems to be cut and pasted in a few different places in mvapich2 so it's hard to tell. Digging into libibverbs, it's not immediately obvious if completion queues use locked memory, but It would seem a likely candidate.

I'm hesitant to try an experiment on corona, especially on a friday, so maybe we can get this set up on fluke next week and try it there.

garlick avatar Jul 29 '22 22:07 garlick

@grondo came up with prlimit(1) which lets you change limits on another process. We did this on the two brokers in a flux mini alloc -N2 allocation, then confirmed that the limits were inherited by jobs and that it fixes the above problem.

To raise the limit on a broker

sudo prlimit --memlock=unlimited -p PID

Then verify that the limit is inherited by jobs and that the problem is resolved:

ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ flux mini run -N2 bash -c 'prlimit --memlock -p $(flux getattr broker.pid)'
RESOURCE DESCRIPTION                             SOFT      HARD UNITS
RESOURCE DESCRIPTION                             SOFT      HARD UNITS
MEMLOCK  max locked-in-memory address space unlimited unlimited bytes
MEMLOCK  max locked-in-memory address space unlimited unlimited bytes

ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ flux mini run --env=MV2_ENABLE_AFFINITY=0 -N2 ./hello
[corona282:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
ƒ7Act2xyV: completed MPI_Init in 0.214s.  There are 2 tasks
ƒ7Act2xyV: completed first barrier in 0.000s
ƒ7Act2xyV: completed MPI_Finalize in 0.005s

garlick avatar Jul 29 '22 22:07 garlick

This indicates we could probably work around the issue for now by using the systemd override.conf proposed above.

grondo avatar Jul 30 '22 02:07 grondo

Confirmed, the work around works, except we discovered the correct setting is

LimitMEMLOCK=unlimited

garlick avatar Aug 02 '22 18:08 garlick

LimitMEMLOCK=unlimited

Do you mean LimitMEMLOCK=infinity?

grondo avatar Aug 02 '22 18:08 grondo

Doh! Yes.

garlick avatar Aug 02 '22 18:08 garlick

This problem is resolved for now with the override.conf workaround in place.

The long term fix is separately tracked in flux-framework/flux-security#148

garlick avatar Aug 26 '22 13:08 garlick