Hui Zhou

Results 695 comments of Hui Zhou

> how does it interact with MPI thread semantics? Could another thread MPI_Send_enqueue to the GPU context at the same time or does that violate the semantics of the underlying...

I don't recall for v3.4.1, but, the embedded libfabric build may hide the libfabric symbols from external users. I believe later versions added Makefile patches to skip building libfabric examples...

The `srun --mpi=pmi2` is working. But looks like the exchange address string gets too long to fit the PMI message limit. Not sure where the inconsistency comes from.

What Jiakun pointed out is it's likely the `PMI2_MAX_VALLEN` in `pmi2.h` is too big. It is `1024` historically. When exchange addresses and the address length is too big, we segment...

So I think the right solution is to fix `PMI2_MAX_VALLEN` in `pmi2.h`. The header should be consistent with the library `libpmi2.so`. If we want to add a environment override, it...

@JiakunYan Does the example in https://github.com/pmodels/mpich/issues/6924#issuecomment-2127357799 work on SDSC Expanse?

> FWIW, a simple Slurm+PMI2 example putting a max size value does not hang on the Bebop cluster here at Argonne. There may still be a bug in the segmented...

``` [cli_0]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=20D0539CE2... ``` I suspect the `PMI2_MAX_VALLEN` didn't account for the size of overhead, i.e. `cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=`....

> Here's output from a modified example that puts and the gets the key. Yeah, this seems to support that Slurm is able accommodate 1024 value size -- its internal...

> libpmi.so.0 => /cm/shared/apps/slurm/current/lib64/libpmi.so.0 (0x000015554f378000) Oh, Just realized @JiakunYan was linking with PMI-1 rather than PMI-2. @raffenet Need test Slurm's PMI-1