Hui Zhou
Hui Zhou
Not really. `mpl` is MPICH's utility lib. It abstracts away some of the platform and compiler dependency. But yeah, it is quite involved. I estimate it's about a week's effort....
The code that aggregates exit code is here : https://github.com/pmodels/mpich/blob/4f8e4a968012d6e59e22b606a76f613d61a7a91c/src/pm/util/process.c#L207-L209 If you can reproduce it locally, I would try add debug printf there to see if anything is usual. PS:...
I see. Thanks for the simple reproducer! I can make gforker to return the same as `hydra`. Essentially, in https://github.com/pmodels/mpich/blob/4f8e4a968012d6e59e22b606a76f613d61a7a91c/src/pm/util/process.c#L407-L427 Line 414, we should just add `rc = prog_stat;`.
Double check that if the bug still persist by turning `FI_HMEM`, `MPICH_CVAR_CH4_OFI_HMEM_ENABLE=1`
> It is still hanging with `mpich/opt/develop-git.6037a7a` (aurora_test branch) and `MPICH_CVAR_CH4_OFI_HMEM_ENABLE=1`. Is it worth trying main? Then this may not even be related to the pipelining path. Does the app...
Thanks @jcosborn for the confirmation. Is the app using non-contiguous datatypes. The non-contig path was not enabled for pipilining in the early versions. We enabled the path since we didn't...
> Uncommenting the buffer size line, or swapping the 'geom' with the commented out one will make it run to completion. That is making the pipelining chunks bigger resulting less...
The original pipeline algorithm will be replaced in https://github.com/pmodels/mpich/pull/7529.
@colleeneb `MPICH_CVAR_CH4_OFI_HMEM_ENABLE` is off, right?
Can we get the long form of the error message, e.g the MPICH error stack?