dikeke

Results 7 comments of dikeke

Is arm64 supported now?

> I have an idea to solve it. But if will waste computility if keeping the resource free and don't let the low-priority small requested jobs to run. > >...

any progress on this issue?

假设集群可用cpu资源为4份。 创建两个pod,占用2份cpu资源。 创建torch-mnist job,设置min_node=2,node-unit=2,max_node=$NODE_NUM,NODE_NUM=4。每个node需要占用1份cpu资源。 3.1 保持资源状况直到训练结束(修改代码) job会有两个worker处于running状态,其他两个worker处于pending状态。两个worker组成rendezvous,并完成训练,状态转换成complete.其他两个pending worker获得资源,转换成running状态,继续训练,但会报错,训练始终无法完成