dlrover Worker pod stuck in Pending state causing TimeoutError and incorrect handling by master

Worker pod stuck in Pending state causing TimeoutError and incorrect handling by master

Open TheAriaYang opened this issue 1 year ago • 2 comments

If one of the worker failed to start, for example, stuck in Pending state, which will caused other pod failed with error below. TimeoutError: Timeout 5400s to complete rendezvous.

In this case, master should killed the pending pod instead of keep restarting other pods.

Description:

One of the worker pods is failing to start and getting stuck in the Pending state. Instead of the master killing the pending pod, it continues to restart other pods, which result in restarted pods still failed with TimeoutError.

We can see log like 'TimeoutError: Timeout 5400s to complete rendezvous'.

Expected Behavior:

When a worker pod fails to start and gets stuck in Pending state, the master should take action to kill the pending pod instead of continually restarting other pods.

Steps to Reproduce:

Launch the worker pod, let it fail and get stuck in Pending state. Observe the master repeatedly restarting other pods.

It is essential for the master to handle the failure of worker pods effectively to prevent cascading failures for other pods.

Jul 02 '24 09:07 TheAriaYang

It is not good to directly delete pending Pods. If the pod pends because of not sufficient resource like GPU/GPU/Memory, the relaunched Pod will pends again. You can set the min_nodes in the command of dlrover-run --nnodes={MIN_NODES}:{MAX_NODES} to start training even if there are pending Pods. Usually, the max_nodes can be the replicas of the worker which can be acuqired from the env NODE_NUM in the worker Pod. So, you can use the command dlrover-run --nnodes=$(（$NODE_NUM} - 1）):${NODE_NUM}.

Jul 05 '24 03:07 workingloong

This issue has been automatically marked as stale because it has not had recent activity.

Oct 13 '24 01:10 github-actions[bot]

This issue is being automatically closed due to inactivity.

Oct 20 '24 01:10 github-actions[bot]

dlrover dlrover copied to clipboard

Worker pod stuck in Pending state causing TimeoutError and incorrect handling by master

dlrover
dlrover copied to clipboard