tf-yarn icon indicating copy to clipboard operation
tf-yarn copied to clipboard

use directly tf.Server to test ports availability

Open jdlesage opened this issue 7 years ago • 3 comments

Instead of testing ports by opening a socket. Launch directly a tf.server that will do the same. It avoids to reconnect to the socket (and all bugs related to that...)

That's the solution used by dask.tensorflow to create a tensorflow cluster.

jdlesage avatar Oct 13 '18 07:10 jdlesage

I have been considering this, but I am afraid it is not straightforward:

  • dask-tensorflow uses hardcoded port range [2222, ....) and assumes that all of the ports are free. If this is not the case, it would just crash. A simple fix would be to add a try-except and a while loop. However, for each failed attempt tf.train.Server would emit a message on stderr confusing the user.
  • An alternative to enumerating a hardcoded range of ports is to bind the server to port 0 but I am not sure it is possible with tf.train.Server.
  • I am also not sure if the cluster spec can be altered after the server has been created (this is needed for the current acquire-broadcast-start scheme).

superbobry avatar Oct 13 '18 18:10 superbobry

Related tf issue created by @superbobry https://github.com/tensorflow/tensorflow/issues/21492

fhoering avatar Nov 29 '18 14:11 fhoering

Discussed again in https://github.com/tensorflow/tensorflow/issues/35383

fhoering avatar Apr 03 '20 11:04 fhoering