ompi icon indicating copy to clipboard operation
ompi copied to clipboard

v5.0.x: Propagate the error up to the user.

Open jsquyres opened this issue 1 year ago • 8 comments
trafficstars

Make sure to reset the generalized request to guarantee not to call the free callback a second time.

Signed-off-by: George Bosilca [email protected] (cherry picked from commit ac3647e16d2b9765fec2e60a61c92e9b8661efc4)

This is the v5.0.x PR corresponding to main PR #11683 Refs #11681.

FYI @dalcinl

jsquyres avatar Jan 19 '24 19:01 jsquyres

@jsquyres I run this PR: https://github.com/mpi4py/mpi4py-testing/actions/runs/7588222142/job/20670268020 My test is failing, but that's not necessarily Open MPI's fault, but my assumptions. This is how things go:

  1. I call Wait, then the error in the free_fn callback triggers, and MPI reports back the error. All good so far.
  2. The call to Wait does set the request to NULL, so it is not yet deallocated.
  3. I call Free to fully delete the request. But this call fails with MPI_ERR_INTERN. Are you sure that's the proper behavior? Technically, there is no error this time, as the free_fn routine is not invoked again, and I believe there is no error this time to report, as the actual error was already reported at Wait.

PS: For reference, this test is passing with MPICH, but that's simply because of their implementation choices.

dalcinl avatar Jan 19 '24 19:01 dalcinl

@dalcinl @jsquyres what is the plan for this change? do we want to carry it in the release next week?

wenduwan avatar Jan 22 '24 19:01 wenduwan

I'll perform a new test today with ompi@main, I'm not convinced the current behavior is the right one (not that my opinion matters, of course).

dalcinl avatar Jan 23 '24 07:01 dalcinl

@dalcinl Did you get a chance to do a test, our goal was to get an rc out today.

janjust avatar Jan 24 '24 14:01 janjust

@janjust Yes, I added my comments in #11681 with a request for the issue to be reopened. IMHO, things are still broken. If it were my call, I would not merge this PR.

dalcinl avatar Jan 24 '24 14:01 dalcinl

Thanks - let me discuss with RMs, thank you very much for the reproducers!

janjust avatar Jan 24 '24 14:01 janjust

Now that @hppritcha has added mpi4py for v5.0.x, I rebased this PR and will check CI result. I assume it's safe to merge if CI passes.

wenduwan avatar Mar 21 '24 21:03 wenduwan

@wenduwan Please take into account #12392, that's IMHO the definitive fix, the changes there should be added here.

dalcinl avatar Mar 22 '24 08:03 dalcinl