ecosystem
ecosystem copied to clipboard
k8s how long is the training process?
I run distributed mnist on k8s. 1 ps and 3 works. After a hour, the status of pods are: NAME READY STATUS RESTARTS AGE distributed-mnist-ps-0-fz4gw 1/1 Running 0 1h distributed-mnist-worker-0-l4nv5 1/1 Running 0 1h distributed-mnist-worker-1-8j8d7 1/1 Running 0 1h distributed-mnist-worker-2-0rjbw 1/1 Running 0 1h
It has trained 1 hour. How long is the training process?Thanks.
@Xingskcs These pods actually takes time to run, to know more about these pods, run
$ kubectl describe pod distributed-mnist-ps-0-fz4gw|grep more
to check full description of any of the pods with their pod name.
@Xingskcs I'm having the same problem. Mine was running for more than 2 hours with 1 ps and 2 workers locally using minikube, but not sure if has just stalled. The cpu is being used the whole time though. Did you get any further on the training?