mpich icon indicating copy to clipboard operation
mpich copied to clipboard

comm: Check for unmatched messages before releasing context_id

Open raffenet opened this issue 2 years ago • 1 comments

Pull Request Description

If there are still messages using a communicator context_id in the unexpected queue at destruction time, they could disturb message matching if the context_id is reused. To prevent this, do not allow a communicator with unmatched messages to be freed, and therefore retain the context_id. See pmodels/mpich#6184.

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

raffenet avatar Sep 21 '22 19:09 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Sep 21 '22 21:09 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Sep 22 '22 14:09 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Sep 22 '22 16:09 raffenet

test:mpich/ch4/most test:mpich/ch3/most test:mpich/warnings

raffenet avatar Sep 22 '22 16:09 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Sep 22 '22 16:09 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Sep 22 '22 16:09 raffenet

Something is up with Jenkins and/or Github. Jobs are failing when checking out the repo.

raffenet avatar Sep 22 '22 17:09 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Sep 22 '22 21:09 raffenet

test:mpich/ch4/most test:mpich/ch3/most test:mpich/warnings

raffenet avatar Sep 23 '22 15:09 raffenet

Dropped the commit changing the default error handler to MPI_COMM_SELF. It appears many of our tests rely on the default being MPI_COMM_WORLD. Will split that change to a separate PR that also fixes the tests.

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Sep 23 '22 15:09 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Sep 27 '22 19:09 raffenet

This change is detecting unreceived messages from unaccepted intercommunicator connection requests. Should that be possible with an MPI_ANY_TAG, MPI_ANY_SOURCE probe/recv? I guess I assumed they would be isolated from regular pt2pt messages. In the Jenkins link, the messages are found on MPI_COMM_WORLD: https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ofi/3943/jenkins_configure=ubsan,label=centos64_review/testReport/junit/(root)/errors_spawn/02755_____errors_spawn_connect_timeout_no_accept_2__MPIR_CVAR_CH3_COMM_CONNECT_TIMEOUT_20_MPIR_CVAR_CH4_COMM_CONNECT_TIMEOUT_20/

raffenet avatar Oct 07 '22 20:10 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Oct 07 '22 21:10 raffenet

This change is detecting unreceived messages from unaccepted intercommunicator connection requests. Should that be possible with an MPI_ANY_TAG, MPI_ANY_SOURCE probe/recv? I guess I assumed they would be isolated from regular pt2pt messages. In the Jenkins link, the messages are found on MPI_COMM_WORLD: https://jenkins-pmrs.cels.anl.gov/job/mpich-review-ch4-ofi/3943/jenkins_configure=ubsan,label=centos64_review/testReport/junit/(root)/errors_spawn/02755_____errors_spawn_connect_timeout_no_accept_2__MPIR_CVAR_CH3_COMM_CONNECT_TIMEOUT_20_MPIR_CVAR_CH4_COMM_CONNECT_TIMEOUT_20/

The issue is that the MPIDI_OFI_DYNPROC_SEND bit is ignored when matching user messages. It needs to be included so preserve the isolation. Will add another patch to fix...

raffenet avatar Oct 07 '22 23:10 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Oct 10 '22 15:10 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Oct 10 '22 15:10 raffenet

Tests look good. Warning test is unrelated and fixed in another PR, I believe. The lone tcp failure is a th_taskmaster timeout on Solaris, which happens from time to time.

raffenet avatar Oct 10 '22 18:10 raffenet

Pushed a minor whitespace fixup. This is ready for review.

raffenet avatar Oct 10 '22 18:10 raffenet

test:mpich/ch4/most test:mpich/ch3/most

raffenet avatar Oct 11 '22 17:10 raffenet

Sorry, forgot to squash my fixup before merging 🤦.

raffenet avatar Oct 12 '22 15:10 raffenet