ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Sessions: error checking for invalid argument

Open dalcinl opened this issue 3 years ago • 7 comments
trafficstars

While extending mpi4py test suite, I've found several issues related to error checking and invalid arguments.

Issues related to using the MPI_SESSION_NULL handle

The following routines do not fail when invoked with the MPI_SESSION_NULL handler:

  • MPI_Session_get_num_psets (deadlocks)
  • MPI_Session_get_nth_pset
  • MPI_Group_from_session_pset
  • MPI_Session_get_errhandler
  • MPI_Session_set_errhandler

Issues related to other invalid arguments

  • MPI_Session_get_nth_pset: Using index n positive but out-of-bounds (larger or equal to the number of processor sets) deadlocks rather than returning with error code (I would suggest using MPI_ERR_ARG error class).
  • MPI_Session_get_pset_info: Requesting the info object from a non existent pset name (e.g. something like the pset_name string "@qerty!#$") does not fail. Instead, it succeeds and returns an info object with a single key/value pair ("size", "0").
  • MPI_Group_from_session_pset: Trying to create a group from a a non existent pset name (e.g. something like the pset_name string "@qerty!#$") errors with MPI_ERR_INTERN. It would be much informative to users if MPI_ERR_ARG where used instead.

dalcinl avatar Jul 21 '22 08:07 dalcinl

Refs #10589

jsquyres avatar Jul 21 '22 16:07 jsquyres

the MPI_Session_get_pset_info call with invalid pset name may be a pmix issue. checking...

hppritcha avatar Aug 12 '22 21:08 hppritcha

@hppritcha I'm seeing just one issue left, although perhaps I did not reported it before. Sorry about that, there were many loose ends, and I just missed the following one.

  • MPI_Session_get_info() with MPI_SESSION_NULL fails with error MPI_ERR_ARG, but should fail with MPI_ERR_SESSION.

dalcinl avatar Aug 17 '22 18:08 dalcinl

@hppritcha Another recent issue [logs], it happened on GitHub actions with 3 MPI processes (but not 1 or 2). I could not reproduce with in main (updated a few hours ago). I'll try to restart the build.

The failing test is related to the following reproducer:

from mpi4py import MPI

session = MPI.Session.Init()
num = session.Get_num_psets()
try:
    pset = session.Get_nth_pset(num)
except MPI.Exception:
    pass
else:
    print("Exception not raised!")
    MPI.COMM_WORLD.Abort()
session.Finalize()

As you can see in the reproducer, trying to get the an out-of-bound pset index should fail with exception. But that's not happening in GitHub Actions when using 3 MPI processes.

Not sure whether this is relevant, but GitHub actions runners have 2 virtual cores, so I'm running with oversubscription turned on.

dalcinl avatar Aug 31 '22 14:08 dalcinl

@hppritcha Another maybe related issue: when running in singleton init mode, the reproducer above deadlocks at the ~pset = session.Get_nth_pset(num)~ num = session.Get_num_psets() line.

dalcinl avatar Sep 01 '22 15:09 dalcinl

@hppritcha Now I'm not sure this new issue is related to this one. Do you want me to open a new one?

dalcinl avatar Sep 01 '22 16:09 dalcinl

its a different problem so please open a different issue.

hppritcha avatar Sep 01 '22 18:09 hppritcha

closed via #10744 and #10784

hppritcha avatar Sep 23 '22 20:09 hppritcha