Jim Garlick

Results 263 comments of Jim Garlick

So sounds like this is still an issue? > one potential path is in cache_store() and cache_store_continuation(). An entry from the flush list is sent to the content backing module...

Since this is a failure to initialize the hca, I wonder if it might also be a manifestation of flux-framework/flux-pmix#40, where the ulimit on locked memory prevents MPI from pinning...

That stack trace: ``` MPIR_Init_thread(493)....: MPID_Init(419)...........: channel initialization failed MPIDI_CH3_Init(550)......: MPIDI_CH3I_RDMA_init(446): rdma_iba_hca_init(1636)..: cannot create cq ``` I _believe_ boils down to a failed call to [ibv_create_cq(3)](https://man7.org/linux/man-pages/man3/ibv_create_cq.3.html) although the code seems...

@grondo came up with [prlimit(1)](https://man7.org/linux/man-pages/man1/prlimit.1.html) which lets you change limits on another process. We did this on the two brokers in a `flux mini alloc -N2` allocation, then confirmed that...

Confirmed, the work around works, except we discovered the correct setting is ``` LimitMEMLOCK=unlimited ```

This problem is resolved for now with the `override.conf` workaround in place. The long term fix is separately tracked in flux-framework/flux-security#148

There is not really any documentation. What we have is the minimum needed to bootstrap mvapich in a PMI-2 only configuration, and a couple of functions needed by Cray MPI...

This issue is more broad than the topic implies, but now that we have nailed down - rfc33 static limits configuration (global and per-queue) and enforcement at job ingest -...