ompi icon indicating copy to clipboard operation
ompi copied to clipboard

[5.0.0] MPI communication peer process has unexpectedly disconnected

Open SimZhou opened this issue 2 years ago • 1 comments

Background information

OpenMPI: 5.0.0, built from source

Please describe the system on which you are running

  • Operating system/version: Ubuntu 18.04
  • Network type: k8s

Details of the problem

Training terminated after a while,

# error message:
... (normal training process)
...
[TR] rank: 5, norm: 1646.42, matches: 115, utts: 256, avg loss: 1.1620, batches: 860
[TR] rank: 2, norm: 1646.42, matches: 119, utts: 256, avg loss: 1.0515, batches: 860
[TR] rank: 0, norm: 1646.42, matches: 130, utts: 256, avg loss: 1.0197, batches: 860
[TR] rank: 7, norm: 1646.42, matches: 123, utts: 256, avg loss: 1.0018, batches: 860
[TR] rank: 3, norm: 1646.42, matches: 120, utts: 256, avg loss: 1.0964, batches: 860
[TR] rank: 4, norm: 1646.42, matches: 131, utts: 256, avg loss: 1.0161, batches: 860
[TR] rank: 6, norm: 1637.52, matches: 133, utts: 256, avg loss: 0.9388, batches: 870
[TR] rank: 1, norm: 1637.52, matches: 140, utts: 256, avg loss: 0.9769, batches: 870
[TR] rank: 2, norm: 1637.52, matches: 130, utts: 256, avg loss: 1.0458, batches: 870
[TR] rank: 4, norm: 1637.52, matches: 138, utts: 256, avg loss: 0.9570, batches: 870
[TR] rank: 5, norm: 1637.52, matches: 118, utts: 256, avg loss: 1.0001, batches: 870
[TR] rank: 0, norm: 1637.52, matches: 109, utts: 256, avg loss: 1.0895, batches: 870
[TR] rank: 3, norm: 1637.52, matches: 128, utts: 256, avg loss: 1.0204, batches: 870
[TR] rank: 7, norm: 1637.52, matches: 134, utts: 256, avg loss: 0.9811, batches: 870
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** An error occurred in Socket closed
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** reported by process [1778253825,2]
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** on a NULL communicator
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** Unknown error
[job-170078988023782404088-yihua-zhou-worker-1:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[job-170078988023782404088-yihua-zhou-worker-1:00000] ***    and MPI will try to terminate your MPI job as well)
^@--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: job-170078988023782404088-yihua-zhou-master-0
  Local PID:  62
  Peer host:  job-170078988023782404088-yihua-zhou-worker-2
--------------------------------------------------------------------------
^@/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))
/opt/conda/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 3 leaked semaphores to clean up at shutdown
  len(cache))

SimZhou avatar Nov 24 '23 02:11 SimZhou

Hi @SimZhou , this could very be outside of OpenMPI, do you have any indication that this is specifically ompi v5.0 related? Does it work with older ompi versions?

janjust avatar Nov 27 '23 20:11 janjust