ecosystem
ecosystem copied to clipboard
how can the parameter server stop itself?
Since we wrote below code in the parameter server part:
server.join()
the parameter server could not stop itself when the training finishes unless we kill the process. do you have other suggestions?
Yes, unfortunately this is a known issue. There's no way to automatically stop parameter servers when the job is done at the moment. I'll ping back on this issue when we have a better solution, but it's not high-priority at the moment.
It is safe to kill the ps process after your training is done (and your checkpoint is saved as well). Do you have a specific concern?
There's a work-around in https://github.com/tensorflow/tensorflow/issues/4713#issuecomment-269499287
Stop ps server gracefully is a requirement when run distributed training with kubernetes batch job. Anyone can write a detail demo base on mnist_replica?
I write a demo for MNIST_data, and it seems run OK. See dist_fifo.