Jim Garlick
Jim Garlick
Oof, glad you ran that down!
So sounds like this is still an issue? > one potential path is in cache_store() and cache_store_continuation(). An entry from the flush list is sent to the content backing module...
Since this is a failure to initialize the hca, I wonder if it might also be a manifestation of flux-framework/flux-pmix#40, where the ulimit on locked memory prevents MPI from pinning...
That stack trace: ``` MPIR_Init_thread(493)....: MPID_Init(419)...........: channel initialization failed MPIDI_CH3_Init(550)......: MPIDI_CH3I_RDMA_init(446): rdma_iba_hca_init(1636)..: cannot create cq ``` I _believe_ boils down to a failed call to [ibv_create_cq(3)](https://man7.org/linux/man-pages/man3/ibv_create_cq.3.html) although the code seems...
@grondo came up with [prlimit(1)](https://man7.org/linux/man-pages/man1/prlimit.1.html) which lets you change limits on another process. We did this on the two brokers in a `flux mini alloc -N2` allocation, then confirmed that...
Confirmed, the work around works, except we discovered the correct setting is ``` LimitMEMLOCK=unlimited ```
Doh! Yes.
This problem is resolved for now with the `override.conf` workaround in place. The long term fix is separately tracked in flux-framework/flux-security#148
There is not really any documentation. What we have is the minimum needed to bootstrap mvapich in a PMI-2 only configuration, and a couple of functions needed by Cray MPI...
This issue is more broad than the topic implies, but now that we have nailed down - rfc33 static limits configuration (global and per-queue) and enforcement at job ingest -...