ecosystem icon indicating copy to clipboard operation
ecosystem copied to clipboard

k8s how long is the training process?

Open Xingskcs opened this issue 7 years ago • 2 comments

I run distributed mnist on k8s. 1 ps and 3 works. After a hour, the status of pods are: NAME READY STATUS RESTARTS AGE distributed-mnist-ps-0-fz4gw 1/1 Running 0 1h distributed-mnist-worker-0-l4nv5 1/1 Running 0 1h distributed-mnist-worker-1-8j8d7 1/1 Running 0 1h distributed-mnist-worker-2-0rjbw 1/1 Running 0 1h

It has trained 1 hour. How long is the training process?Thanks.

Xingskcs avatar Sep 25 '17 10:09 Xingskcs

@Xingskcs These pods actually takes time to run, to know more about these pods, run

 $ kubectl describe pod distributed-mnist-ps-0-fz4gw|grep more

to check full description of any of the pods with their pod name.

amitkumarj441 avatar Oct 04 '17 14:10 amitkumarj441

@Xingskcs I'm having the same problem. Mine was running for more than 2 hours with 1 ps and 2 workers locally using minikube, but not sure if has just stalled. The cpu is being used the whole time though. Did you get any further on the training?

iprocha avatar May 30 '18 14:05 iprocha