ompi
ompi copied to clipboard
mpi4py: Regression in spawn tests
I believe changes over the last week may have introduce issues in spawn support. Two successive runs of mpi4py testsuite both failed at the same point. From the traceback, looks like the issue happens while children run MPI_Init_thread
.
https://github.com/mpi4py/mpi4py-testing/runs/7703615156?check_suite_focus=true#step:17:1365
Traceback from link above
testArgsOnlyAtRootMultiple (test_spawn.TestSpawnSelf) ... [fv-az292-337:164868] *** Process received signal ***
[fv-az292-337:164868] Signal: Segmentation fault (11)
[fv-az292-337:164868] Signal code: Address not mapped (1)
[fv-az292-337:164868] Failing at address: 0x55a66b9ee180
[fv-az292-337:164868] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fdecf9c8090]
[fv-az292-337:164868] [ 1] /usr/local/lib/libopen-pal.so.0(+0xc8fc4)[0x7fdecebcffc4]
[fv-az292-337:164868] [ 2] /usr/local/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x45)[0x7fdecebd1733]
[fv-az292-337:164868] [ 3] /usr/local/lib/libopen-pal.so.0(+0xca9ab)[0x7fdecebd19ab]
[fv-az292-337:164868] [ 4] /usr/local/lib/libopen-pal.so.0(+0xcacab)[0x7fdecebd1cab]
[fv-az292-337:164868] [ 5] /usr/local/lib/libopen-pal.so.0(opal_progress+0x43)[0x7fdeceb3bd6f]
[fv-az292-337:164868] [ 6] /usr/local/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1ef)[0x7fdecebf1d3f]
[fv-az292-337:164868] [ 7] /usr/local/lib/libmpi.so.0(+0xa813e)[0x7fdececec13e]
[fv-az292-337:164868] [ 8] /usr/local/lib/libmpi.so.0(ompi_request_default_wait+0x2b)[0x7fdececec385]
[fv-az292-337:164868] [ 9] /usr/local/lib/libmpi.so.0(ompi_coll_base_bcast_intra_generic+0x760)[0x7fdecedd304b]
[fv-az292-337:164868] [10] /usr/local/lib/libmpi.so.0(ompi_coll_base_bcast_intra_pipeline+0x1a3)[0x7fdecedd3551]
[fv-az292-337:164868] [11] /usr/local/lib/libmpi.so.0(ompi_coll_tuned_bcast_intra_do_this+0x126)[0x7fdecee0bd76]
[fv-az292-337:164868] [12] /usr/local/lib/libmpi.so.0(ompi_coll_tuned_bcast_intra_dec_fixed+0x43c)[0x7fdecee02832]
[fv-az292-337:164868] [13] /usr/local/lib/libmpi.so.0(ompi_dpm_connect_accept+0x8a8)[0x7fdececbf3b7]
[fv-az292-337:164868] [14] /usr/local/lib/libmpi.so.0(ompi_dpm_dyn_init+0xd6)[0x7fdececccb28]
[fv-az292-337:164868] [15] /usr/local/lib/libmpi.so.0(ompi_mpi_init+0x837)[0x7fdececeeece]
[fv-az292-337:164868] [16] /usr/local/lib/libmpi.so.0(PMPI_Init_thread+0xdd)[0x7fdeced59548]
[fv-az292-337:164868] [17] /opt/hostedtoolcache/Python/3.10.5/x64/lib/python3.10/site-packages/mpi4py/MPI.cpython-310-x86_64-linux-gnu.so(+0x33f67)[0x7fdecf14af67]
[fv-az292-337:164868] [18] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(PyModule_ExecDef+0x73)[0x7fdecfdcc0c3]
[fv-az292-337:164868] [19] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x274460)[0x7fdecfdfa460]
[fv-az292-337:164868] [20] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x19745e)[0x7fdecfd1d45e]
[fv-az292-337:164868] [21] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(PyObject_Call+0x8e)[0x7fdecfceeffe]
[fv-az292-337:164868] [22] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x630b)[0x7fdecfd6bddb]
[fv-az292-337:164868] [23] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [24] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x5021)[0x7fdecfd6aaf1]
[fv-az292-337:164868] [25] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [26] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x773)[0x7fdecfd66243]
[fv-az292-337:164868] [27] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] [28] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x33e)[0x7fdecfd65e0e]
[fv-az292-337:164868] [29] /opt/hostedtoolcache/Python/3.10.5/x64/lib/libpython3.10.so.1.0(+0x1de4dc)[0x7fdecfd644dc]
[fv-az292-337:164868] *** End of error message ***
[fv-az292-337:164866] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
[fv-az292-337:164855] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
ERROR
[fv-az292-337:164867] OPAL ERROR: Server not available in file dpm/dpm.c at line 403
testCommSpawn (test_spawn.TestSpawnSelf) ... [fv-az292-337:00000] *** An error occurred in MPI_Init_thread
[fv-az292-337:00000] *** reported by process [1431306243,1]
[fv-az292-337:00000] *** on a NULL communicator
[fv-az292-337:00000] *** Unknown error
[fv-az292-337:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fv-az292-337:00000] *** and MPI will try to terminate your MPI job as well)
ok
testCommSpawnMultiple (test_spawn.TestSpawnSelf) ... 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
Extra bits from valgrind (locan run with debug build)
==1494211== Conditional jump or move depends on uninitialised value(s)
==1494211== at 0x167996FC: pmix_bfrops_base_value_unload (bfrop_base_fns.c:409)
==1494211== by 0x16798687: PMIx_Value_unload (bfrop_base_fns.c:54)
==1494211== by 0x16065326: ompi_dpm_connect_accept (dpm.c:423)
==1494211== by 0x160CCFD5: PMPI_Comm_spawn_multiple (comm_spawn_multiple.c:199)
==1494211== by 0x15F1DD69: __pyx_pf_6mpi4py_3MPI_9Intracomm_38Spawn_multiple (MPI.c:149745)
==1494211== by 0x15F1D6F8: __pyx_pw_6mpi4py_3MPI_9Intracomm_39Spawn_multiple (MPI.c:149423)
==1494211== by 0x4991160: cfunction_call (methodobject.c:543)
==1494211== by 0x498D262: _PyObject_MakeTpCall (call.c:215)
==1494211== by 0x498C590: UnknownInlinedFun (abstract.h:112)
==1494211== by 0x498C590: UnknownInlinedFun (abstract.h:99)
==1494211== by 0x498C590: UnknownInlinedFun (abstract.h:123)
==1494211== by 0x498C590: call_function (ceval.c:5869)
==1494211== by 0x4985D92: _PyEval_EvalFrameDefault (ceval.c:4231)
==1494211== by 0x49838D2: UnknownInlinedFun (pycore_ceval.h:46)
==1494211== by 0x49838D2: _PyEval_Vector (ceval.c:5065)
==1494211== by 0x4998FD7: UnknownInlinedFun (call.c:342)
==1494211== by 0x4998FD7: UnknownInlinedFun (abstract.h:114)
==1494211== by 0x4998FD7: method_vectorcall (classobject.c:53)
@awlauria Is this due to PMIx / PRTE updates?
@jsquyres Looks like that's the case. The two build below show that the regression comes from 4896db17dda399c6c389408a7dad1395d3094521.
- https://github.com/mpi4py/mpi4py-testing/runs/7706338665?check_suite_focus=true corresponds to 5c302accafaf03d00950d3444fc3d29314bb88b1 (good)
- https://github.com/mpi4py/mpi4py-testing/runs/7706340330?check_suite_focus=true corresponds to 4896db17dda399c6c389408a7dad1395d3094521 (broken)
@awlauria Can you provide any ETA for you looking into this?
An additional pointer: using a local debug build, the issue seems to be happen only with MPI_Comm_spawn_multiple()
. All of my tests involving MPI_Comm_spawn()
are successful.
A quick glance at the trace shows the failure is in the btl/sm component. A grep of that code shows the only PMIx dependency is on a modex_recv of PMIX_LOCAL_RANK
, with the module subsequently attempting to connect/send to the proc of that local rank.
The problem is clearly that the btl/sm is looking for the wrong value here. It needs to look for PMIX_NODE_RANK
. I've told you folks this multiple times, and it has indeed been fixed before - but it seems to keep getting re-broken.
Just curious: am I the only one doing any triage on these issues? I don't look at many nor very often, but when I do look at one, it seems that the reason for the problem is very quick/easy to identify.
A simple print statement is all that is required to immediately show the problem - printing out the backing file:
[Ralphs-iMac-2.local:83492] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/1/sm_segment.Ralphs-iMac-2.1000.12790001.2
[Ralphs-iMac-2.local:83491] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/1/sm_segment.Ralphs-iMac-2.1000.12790001.1
[Ralphs-iMac-2.local:83490] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/1/sm_segment.Ralphs-iMac-2.1000.12790001.0
Parent [pid 83490] about to spawn!
Parent [pid 83492] about to spawn!
Parent [pid 83491] about to spawn!
[Ralphs-iMac-2.local:83494] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/2/sm_segment.Ralphs-iMac-2.1000.12790002.0
[Ralphs-iMac-2.local:83493] BACKING FILE /Users/rhc/tmp/prte.Ralphs-iMac-2.1000/dvm.83489/2/sm_segment.Ralphs-iMac-2.1000.12790002.0
where the last digit of the filename is the local rank. You can see that the spawned procs step on each others backing file because they use their local rank, which is the same as they are from two app_contexts. The connection to the local rank is made by:
#define MCA_BTL_SM_LOCAL_RANK opal_process_info.my_local_rank
All you need do is change it to the node rank.
I created a small test program: here
If I run with ucx
it passes both of the following:
shell$ export OMPI_MCA_pml=ucx
shell$ mpirun -np 1 ./simple_spawn_multiple ./simple_spawn_multiple
Hello from a Child (A)
Hello from a Child (B)
Hello from a Child (B)
Spawning Multiple './simple_spawn_multiple' ... OK
shell$ ./simple_spawn_multiple ./simple_spawn_multiple
Spawning Multiple './simple_spawn_multiple' ... OK
We don't get the IO from the child processes in the second example (singleton spawn multiple), but that's a separate issue.
However, If I use ob1
then I can reproduce this issue with it ending in a segv. Making the change Ralph suggested ended in a hang. Investigating the hang further lead to the conclusion that there is a bug in PRRTE.
I took a look at the suggestion from @rhc54 and I don't think that will work. It will address the backing file problem, but the processes are now confused because they are getting incorrect values (it seems to me) for local_rank
and local_peers
. OpenPMIx/PRRTE is returning values relative to their appcontext not the single spawn operation.
A bit of debugging to help - I added the following towards the end of ompi/runtime/ompi_rte.c
+ opal_output(0, "JJH DEBUG) %d is [%s:%d] / [%d:%d] local_rank = %d, local_peers = %d, node_rank = %d",
+ getpid(),
+ opal_process_info.myprocid.nspace, opal_process_info.myprocid.rank,
+ opal_process_info.my_name.jobid, opal_process_info.my_name.vpid,
+ opal_process_info.my_local_rank,
+ opal_process_info.num_local_peers,
+ opal_process_info.my_node_rank);
[jjhursey@f5n17 mpi] mpirun -np 1 ./simple_spawn_multiple ./simple_spawn_multiple
[f5n17:3484068] JJH DEBUG) 3484068 is [prterun-f5n17-3484059@1:0] / [1625489409:0] local_rank = 0, local_peers = 0, node_rank = 0
[f5n17:3484072] JJH DEBUG) 3484072 is [prterun-f5n17-3484059@2:1] / [1625489410:1] local_rank = 0, local_peers = 2, node_rank = 2
[f5n17:3484071] JJH DEBUG) 3484071 is [prterun-f5n17-3484059@2:0] / [1625489410:0] local_rank = 0, local_peers = 2, node_rank = 1
[f5n17:3484073] JJH DEBUG) 3484073 is [prterun-f5n17-3484059@2:2] / [1625489410:2] local_rank = 1, local_peers = 2, node_rank = 3
- PID
3484068
is the parent (the one callingMPI_Comm_spawn_multiple
-
local_rank = 0, local_peers = 0, node_rank = 0
- It's PMIx namespace (
prterun-f5n17-3484059@1
) is unique from the childrenprterun-f5n17-3484059@2
which is expected
-
- Children:
- PID
3484071
is 1 process in the first appcontext passed toMPI_Comm_spawn_multiple
- PMIx name:
[prterun-f5n17-3484059@2:0]
- PMIx name:
- PIDs
3484072
and3484073
are the 2 processes in the second appcontext passed toMPI_Comm_spawn_multiple
- PMIx names:
[prterun-f5n17-3484059@2:1]
and[prterun-f5n17-3484059@2:2]
- PMIx names:
- So the
rank
in the PMIx name is correct, and the namespace is unique for the full set of 3 processes in the namespace. - However, the
local_rank
(PMIX_LOCAL_RANK
) andlocal_peers
(PMIX_LOCAL_SIZE
) values are not relative to the namespace, but relative to the app context.- it seems that their values correspond to
PMIX_APP_RANK
andPMIX_APP_SIZE
instead.
- it seems that their values correspond to
- PID
From the PMIx standard 4.1
-
PMIX_LOCAL_RANK
-
Rank of the specified process on its node - refers to the numerical location (starting from zero) of the process on its node when counting only those processes from the same job that share the node, ordered by their overall rank within that job.
-
PMIX_LOCAL_SIZE
-
Number of processes in the specified job or application realm on the caller’s node. Defaults to job realm unless the PMIX_APP_INFO and the PMIX_APPNUM qualifiers are given.
This seems to indicate that there is a bug in PRRTE that needs fixing.
I filed an issue on the PRRTE side to gain visibility: https://github.com/openpmix/prrte/issues/1445
I think I found the problem in PRRTE, but I'll need @rhc54 to help with the fix. See the note here
Fix has been committed to PRRTE master and ported to v3.0. It will fix this immediate problem, but still begs the full issue.
Let me explain my comments about node vs local rank. The rationale behind node rank lies in the fault tolerance area. If a proc from a given app dies on a node, and then a proc from that app (either the one that died or some migration) is restarted on that node, then the local rank gets reused - but the node rank does not. If you are using local rank, the app has the potential to crash on that node as the conflict will take down all the procs that were "connected" via the btl/sm to that local rank. Before you had FT, it didn't really make much difference - now that OMPI is supporting FT, it is problematic.
If you have added logic elsewhere in OMPI to correct the problem, then perhaps this is not as critical as it used to be. Nathan and I had spent a fair amount of time on this issue and concluded that using node rank was the best solution, but perhaps that has changed.
For reference:
- PRRTE master : https://github.com/openpmix/prrte/pull/1446 (merged)
- PRRTE v3.0 : https://github.com/openpmix/prrte/pull/1447 (merged)
@awlauria We will need to pick this PRRTE change up as well.
FYI: I can confirm that the PRRTE fix addresses this issue. I change my prrte submodule to the v3.0
branch (including 1447) and was able to run successfully without any OMPI modifications:
[jjhursey@f5n17 mpi] ./simple_spawn_multiple ./simple_spawn_multiple
Spawning Multiple './simple_spawn_multiple' ... OK
[jjhursey@f5n17 mpi] mpirun -np 1 ./simple_spawn_multiple ./simple_spawn_multiple
Hello from a Child (B)
Hello from a Child (B)
Hello from a Child (A)
Spawning Multiple './simple_spawn_multiple' ... OK
What Ralph mentions about using the node rank vs local rank makes sense. I filed a PR #10690 to make that change, but I want someone supporting FT to review so I flagged @abouteiller .
I filed Issue #10691 to track the missing IO.
Once the PRRTE submodule is updated then it should close this ticket.
The fixes were merged into PRRTE v3
and the submodule pointer for Open MPI v5.0.x
have been updated. I think we are good to close this issue.