returnn Horovod training MPI_Allreduce MPI_ERR_INTERN: internal error, with restart_after_num_net

It already trained 10 subepochs, and then restart_after_num_net_reinit became effective, and then I got this error:

...
reinit network too often, 1 times after 10 training epochs, restart
...
[cluster-cn-247:28492] *** An error occurred in MPI_Allreduce
[cluster-cn-247:28492] *** reported by process [635830273,3]
[cluster-cn-247:28492] *** on communicator MPI COMMUNICATOR 8 DUP FROM 7
[cluster-cn-247:28492] *** MPI_ERR_INTERN: internal error
[cluster-cn-247:28492] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, 
[cluster-cn-247:28492] ***    and potentially your MPI job)
...
[cluster-cn-231][[9702,1],1][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) 
...
[cluster-cn-231:08438] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[cluster-cn-231:08438] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I assume the restart due to restart_after_num_net_reinit somehow was too abruptive and confused some of the other workers.

But I don't exactly understand it. Why are there any other workers waiting in MPI_Allreduce when some worker already does the restart?

Anyway, we probably need to properly sync or signal the exit before we really do this restart.

Nov 18 '22 08:11 albertz

Happens again:

...
epoch 20 'devtrain' eval, finished after 153 steps, 0:00:17 elapsed (81.9% computing time)
...
reinit network too often, 2 times after 10 training epochs, restart
...
[cluster-cn-286:17972] *** An error occurred in MPI_Allreduce
[cluster-cn-286:17972] *** reported by process [2925658113,0]
[cluster-cn-286:17972] *** on communicator MPI COMMUNICATOR 8 DUP FROM 7
[cluster-cn-286:17972] *** MPI_ERR_INTERN: internal error
[cluster-cn-286:17972] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cluster-cn-286:17972] ***    and potentially your MPI job)

Nov 18 '22 10:11 albertz

I noticed we never call hvd.shutdown properly. I now added that, also in the reinit logic. Although I'm not totally sure whether this works well, i.e. that we can reinit it later.

Nov 18 '22 10:11 albertz

I noticed we never call hvd.shutdown properly. I now added that, also in the reinit logic. Although I'm not totally sure whether this works well, i.e. that we can reinit it later.

This does not seem to work well. It restarts and then seems to hang in hvd.init:

reinit network too often, 2 times after 10 training epochs, restart
2022-11-18 13:57:25.306450: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2022-11-18 13:57:26.932290: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 
Horovod: 0.26.1 /u/zeyer/.local/lib/python3.8/site-packages/horovod/__init__.py
Horovod: 0.26.1 /u/zeyer/.local/lib/python3.8/site-packages/horovod/__init__.py
Horovod: 0.26.1 /u/zeyer/.local/lib/python3.8/site-packages/horovod/__init__.py

Related: https://github.com/horovod/horovod/issues/159 https://github.com/horovod/horovod/issues/667

Nov 18 '22 20:11 albertz

Horovod training MPI_Allreduce MPI_ERR_INTERN: internal error, with restart_after_num_net_reinit