flux-core
                                
                                 flux-core copied to clipboard
                                
                                    flux-core copied to clipboard
                            
                            
                            
                        mvapich2-tce: cannot create cq
As noted in #4371, this error was encountered on corona (TOSS4):
ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ module list
Currently Loaded Modules:
  1) intel-tce/19.0.4   2) StdEnv (S)   3) mvapich2-tce/2.3.6
ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ flux mini run --env=MV2_ENABLE_AFFINITY=0 -N2 ./hello
[corona282:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
0.260s: job.exception type=exec severity=0 MPI_Abort: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)....:  MPID_Init(419)...........: channel initialization failed MPIDI_CH3_Init(550)......:  MPIDI_CH3I_RDMA_init(446):  rdma_iba_hca_init(1636)..: cannot create cq 
0.261s: job.exception type=exec severity=0 MPI_Abort: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(493)....:  MPID_Init(419)...........: channel initialization failed MPIDI_CH3_Init(550)......:  MPIDI_CH3I_RDMA_init(446):  rdma_iba_hca_init(1636)..: cannot create cq 
flux-job: task(s) exited with exit code 1
Since this is a failure to initialize the hca, I wonder if it might also be a manifestation of flux-framework/flux-pmix#40, where the ulimit on locked memory prevents MPI from pinning its buffers?
Oh, good guess. I'm wondering if you could test this theory by adding
LimitMEMLOCK=unlimited
to /etc/systemd/system/flux.service.d/override.conf and restarting the brokers?
Note: untested, the unlimited syntax may not be correct and I didn't verify the path for overrides.
Edit: for this to work, the memlock limit would have to be inherited from flux-broker to IMP to shell to tasks. I think that would be the case, but again, not tested.
That stack trace:
MPIR_Init_thread(493)....:
MPID_Init(419)...........: channel initialization failed
MPIDI_CH3_Init(550)......:
MPIDI_CH3I_RDMA_init(446):
rdma_iba_hca_init(1636)..: cannot create cq
I believe boils down to a failed call to ibv_create_cq(3) although the code seems to be cut and pasted in a few different places in mvapich2 so it's hard to tell. Digging into libibverbs, it's not immediately obvious if completion queues use locked memory, but It would seem a likely candidate.
I'm hesitant to try an experiment on corona, especially on a friday, so maybe we can get this set up on fluke next week and try it there.
@grondo came up with prlimit(1) which lets you change limits on another process.  We did this on the two brokers in a flux mini alloc -N2 allocation, then confirmed that the limits were inherited by jobs and that it fixes the above problem.
To raise the limit on a broker
sudo prlimit --memlock=unlimited -p PID
Then verify that the limit is inherited by jobs and that the problem is resolved:
ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ flux mini run -N2 bash -c 'prlimit --memlock -p $(flux getattr broker.pid)'
RESOURCE DESCRIPTION                             SOFT      HARD UNITS
RESOURCE DESCRIPTION                             SOFT      HARD UNITS
MEMLOCK  max locked-in-memory address space unlimited unlimited bytes
MEMLOCK  max locked-in-memory address space unlimited unlimited bytes
ƒ(s=2,d=1) [garlick@corona282:mpi-test]$ flux mini run --env=MV2_ENABLE_AFFINITY=0 -N2 ./hello
[corona282:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
ƒ7Act2xyV: completed MPI_Init in 0.214s.  There are 2 tasks
ƒ7Act2xyV: completed first barrier in 0.000s
ƒ7Act2xyV: completed MPI_Finalize in 0.005s
This indicates we could probably work around the issue for now by using the systemd override.conf proposed above.
Confirmed, the work around works, except we discovered the correct setting is
LimitMEMLOCK=unlimited
LimitMEMLOCK=unlimited
Do you mean LimitMEMLOCK=infinity?
Doh! Yes.
This problem is resolved for now with the override.conf workaround in place.
The long term fix is separately tracked in flux-framework/flux-security#148