tf-yarn
tf-yarn copied to clipboard
use directly tf.Server to test ports availability
Instead of testing ports by opening a socket. Launch directly a tf.server that will do the same. It avoids to reconnect to the socket (and all bugs related to that...)
That's the solution used by dask.tensorflow to create a tensorflow cluster.
I have been considering this, but I am afraid it is not straightforward:
- dask-tensorflow uses hardcoded port range [2222, ....) and assumes that all of the ports are free. If this is not the case, it would just crash. A simple fix would be to add a try-except and a while loop. However, for each failed attempt
tf.train.Serverwould emit a message on stderr confusing the user. - An alternative to enumerating a hardcoded range of ports is to bind the server to port 0 but I am not sure it is possible with
tf.train.Server. - I am also not sure if the cluster spec can be altered after the server has been created (this is needed for the current acquire-broadcast-start scheme).
Related tf issue created by @superbobry https://github.com/tensorflow/tensorflow/issues/21492
Discussed again in https://github.com/tensorflow/tensorflow/issues/35383