ecosystem icon indicating copy to clipboard operation
ecosystem copied to clipboard

how can the parameter server stop itself?

Open YongCHN opened this issue 8 years ago • 5 comments

Since we wrote below code in the parameter server part: server.join()

the parameter server could not stop itself when the training finishes unless we kill the process. do you have other suggestions?

YongCHN avatar Nov 01 '16 08:11 YongCHN

Yes, unfortunately this is a known issue. There's no way to automatically stop parameter servers when the job is done at the moment. I'll ping back on this issue when we have a better solution, but it's not high-priority at the moment.

jhseu avatar Nov 01 '16 21:11 jhseu

It is safe to kill the ps process after your training is done (and your checkpoint is saved as well). Do you have a specific concern?

yuefengz avatar Nov 01 '16 23:11 yuefengz

There's a work-around in https://github.com/tensorflow/tensorflow/issues/4713#issuecomment-269499287

yaroslavvb avatar Jan 04 '17 18:01 yaroslavvb

Stop ps server gracefully is a requirement when run distributed training with kubernetes batch job. Anyone can write a detail demo base on mnist_replica?

hustcat avatar Feb 17 '17 09:02 hustcat

I write a demo for MNIST_data, and it seems run OK. See dist_fifo.

hustcat avatar Feb 20 '17 11:02 hustcat