dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

scale down allreduct pytorch job won't complete and report error

Open cocodee opened this issue 1 year ago • 1 comments

cocodee avatar Jul 29 '24 07:07 cocodee

假设集群可用cpu资源为4份。 创建两个pod,占用2份cpu资源。 创建torch-mnist job,设置min_node=2,node-unit=2,max_node=$NODE_NUM,NODE_NUM=4。每个node需要占用1份cpu资源。 3.1 保持资源状况直到训练结束(修改代码) job会有两个worker处于running状态,其他两个worker处于pending状态。两个worker组成rendezvous,并完成训练,状态转换成complete.其他两个pending worker获得资源,转换成running状态,继续训练,但会报错,训练始终无法完成

cocodee avatar Aug 02 '24 01:08 cocodee

This issue has been automatically marked as stale because it has not had recent activity.

github-actions[bot] avatar Oct 31 '24 01:10 github-actions[bot]

This issue is being automatically closed due to inactivity.

github-actions[bot] avatar Nov 08 '24 01:11 github-actions[bot]