Jim Garlick

Results 344 comments of Jim Garlick
trafficstars

I think if we create a `flux_get_process_scope()` API call, we should be sure it returns something sensible no matter where it is used. Looking over the current PR it would...

Two data points to add: - in my test, flux-coral2 was an older version (and it was not an el cap system) - the same thing happened on two different...

That certainly sounds like a good lead! I'm not sure it's relevant but on the systems in question (rzvernal and tioga) I believe we currently are not configuring the systems...

> It's possible the JGF reader bug contributed, too, if Flux was using JGF in the scheduling R key. For the record, the scheduling key was not populated.

The STAT graph is as follows (courtesy @lee218llnl) ![Screenshot 2023-09-19 130905](https://github.com/flux-framework/flux-core/assets/169947/674ee3d6-71d4-4c1c-9964-4e896d3c2e81) [ompi source ref](https://github.com/open-mpi/ompi/blob/34123c3b15b03209891d5e55bee9ee07baecbdca/opal/mca/common/ucx/common_ucx.c#L468) (courtesy Tom)

Hmm I seem to be able to reproduce this pretty consistently with a 3 node run of a simple MPI hello program. ``` [garlick@corona206:mpi]$ module list Currently Loaded Modules: 1)...

pmix trace of the same 3 node hello run but with `-o pmi=pmix` which works for some reason. Maybe there's some clue to be found here. ``` [garlick@corona206:mpi]$ flux run...

And the failing one with simple pmi ``` [garlick@corona206:mpi]$ flux run --label-io -o verbose=2 -N3 ./hello 0.066s: flux-shell[2]: DEBUG: Loading /etc/flux/shell/initrc.lua 0.066s: flux-shell[2]: TRACE: Successfully loaded flux.shell module 0.066s: flux-shell[2]:...

Ah one red flag. The values being retrieved from the PMI KVS are exactly 1024 bytes in length. In simple PMI we have ```c #define SIMPLE_KVS_VAL_MAX 1024 ``` ~~Looking back...

Eh except I must've shoved too much into `wc` when I got 1053. The puts are actually logged as 1024 also. There might still be truncation elsewhere though as that's...