Ma JieYue
Ma JieYue
### What changes were proposed in this pull request? add try-exception to protect _get_master_addr_port and add retry of the calling to relieve the port sync problem ### Why are the...
# Background Dlrover is an elastic deep learning framework, with fault-tolerance of processes failure, POD losting etc. Since the LLM training is at large scale and always span for a...
### What changes were proposed in this pull request? Be adapt to Ascend NPU cases, to stop workers and make sure no remaining processes are using NPU ### Why are...