mpich icon indicating copy to clipboard operation
mpich copied to clipboard

bug/jenkins: coll/nonblocking3 timeout

Open hzhou opened this issue 5 years ago • 2 comments

  ---
  Directory: ./coll
  File: nonblocking3
  Num-procs: 10
  Timeout: 180
  Date: "Wed Sep 16 01:20:29 2020"
  ...
## Test output (expected 'No Errors'):
## [[email protected]] APPLICATION TIMED OUT, TIMEOUT = 180s
## 
##   uptime:
##  01:20:29 up 370 days,  9:34,  1 user,  load average: 22.14, 16.17, 12.55

It shows up on nightly tests consistently since 9/16/2020. Potential PRs that introduced the bug: #4786 #4788

I wasn't able to manually reproduce it.

hzhou avatar Sep 19 '20 16:09 hzhou

Nightly tests have been clean for a while. It must be fixed at some point.

hzhou avatar May 15 '21 14:05 hzhou

@sagarth was able to reproduce the failure independent of xpmem: https://github.com/pmodels/mpich/pull/5375#issuecomment-904988673. The relevant information:

the behavior was the same during the timeouts: few ranks were stuck inside Init_shm_barrier while one rank making progress loop.

hzhou avatar Aug 25 '21 16:08 hzhou