dlrover
dlrover copied to clipboard
Worker pod stuck in Pending state causing TimeoutError and incorrect handling by master
If one of the worker failed to start, for example, stuck in Pending state, which will caused other pod failed with error below. TimeoutError: Timeout 5400s to complete rendezvous.
In this case, master should killed the pending pod instead of keep restarting other pods.
Description:
One of the worker pods is failing to start and getting stuck in the Pending state. Instead of the master killing the pending pod, it continues to restart other pods, which result in restarted pods still failed with TimeoutError.
We can see log like 'TimeoutError: Timeout 5400s to complete rendezvous'.
Expected Behavior:
When a worker pod fails to start and gets stuck in Pending state, the master should take action to kill the pending pod instead of continually restarting other pods.
Steps to Reproduce:
Launch the worker pod, let it fail and get stuck in Pending state. Observe the master repeatedly restarting other pods.
It is essential for the master to handle the failure of worker pods effectively to prevent cascading failures for other pods.
It is not good to directly delete pending Pods. If the pod pends because of not sufficient resource like GPU/GPU/Memory, the relaunched Pod will pends again. You can set the min_nodes in the command of dlrover-run --nnodes={MIN_NODES}:{MAX_NODES} to start training even if there are pending Pods. Usually, the max_nodes can be the replicas of the worker which can be acuqired from the env NODE_NUM in the worker Pod. So, you can use the command dlrover-run --nnodes=$(($NODE_NUM} - 1)):${NODE_NUM}.
This issue has been automatically marked as stale because it has not had recent activity.
This issue is being automatically closed due to inactivity.