mpich
mpich copied to clipboard
bug/jenkins: coll/nonblocking3 timeout
---
Directory: ./coll
File: nonblocking3
Num-procs: 10
Timeout: 180
Date: "Wed Sep 16 01:20:29 2020"
...
## Test output (expected 'No Errors'):
## [[email protected]] APPLICATION TIMED OUT, TIMEOUT = 180s
##
## uptime:
## 01:20:29 up 370 days, 9:34, 1 user, load average: 22.14, 16.17, 12.55
It shows up on nightly tests consistently since 9/16/2020. Potential PRs that introduced the bug: #4786 #4788
I wasn't able to manually reproduce it.
Nightly tests have been clean for a while. It must be fixed at some point.
@sagarth was able to reproduce the failure independent of xpmem: https://github.com/pmodels/mpich/pull/5375#issuecomment-904988673. The relevant information:
the behavior was the same during the timeouts: few ranks were stuck inside Init_shm_barrier while one rank making progress loop.