JieFu Zhu
JieFu Zhu
same problem here
We are always experiencing this error after the model has been running for enough hours(Almost as always after about 4 hours after the sl is stuck at waiting for merging),...
We just had another wrong with the same setup. A different error message pops up but with similar behavior. The training has been going well for 4 hours but one...
We are facing the same problem again even by extending the SL failure timeout to long enough. One of the nodes just got frozen during the merging process and the...
[swarm_logs.zip](https://github.com/HewlettPackard/swarm-learning/files/14535181/swarm_logs.zip)
logs from the failing node: [swarm_logs (1).zip](https://github.com/HewlettPackard/swarm-learning/files/14537758/swarm_logs.1.zip)