mpich icon indicating copy to clipboard operation
mpich copied to clipboard

coll: replace MPIR_ERR_COLL_CHECKANDCONT

Open hzhou opened this issue 1 year ago • 1 comments

Pull Request Description

Replace MPIR_ERR_COLL_CHECKANDCONT with MPIR_ERR_CHECK. Propagating errors in collective does not work due to the complexity of collective algorithms. For example, the error condition is not guaranteed to be propagated to all processes. In addition, when there is a random hardware issue preventing the message to be delivered, trying to propagate error only hides the error and results in hang anyway.

[skip warnings]

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Aug 27 '24 04:08 hzhou

test:mpich/ch3/most test:mpich/ch4/most

hzhou avatar Aug 27 '24 19:08 hzhou

Is error propagation required by the MPI standard for collectives?

abrooks98 avatar Aug 29 '24 14:08 abrooks98

Is error propagation required by the MPI standard for collectives?

No.

hzhou avatar Aug 29 '24 15:08 hzhou