benchmarks
benchmarks copied to clipboard
It's necessary to endure that the network bandwidth is used up
I created an issue #284 about 4 months ago, and I suggested that we should replace tf.train.Supervisor with tf.train.MonitoredTrainingSession, as the later will restart session when facing OS Error(or communication error between PS and workers caused by network bandwidth used up), but I got no response. I tried to do it myself, and I found that when the communication overhead of one machine exceeds its network bandwidth, a worker will throw _PREEMPTION_ERRORS. And MonitoredTrainingSession on the worker will capture the error and restart MonitoredTrainingSession, but it will be blocked in initialing stage.
I0523 11:21:32.438139 140497580463872 tf_logging.py:115] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: OS Error
I0523 11:21:32.580840 140497580463872 tf_logging.py:115] Graph was finalized.
2019-05-23 11:21:32.631653: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 2d2cf0b67505bc5e with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 18 gpu_options { } allow_soft_placement: true experimental { collective_group_leader: "/job:worker/replica:0/task:0" }
I0523 11:21:32.963238 140497580463872 tf_logging.py:115] Running local_init_op.
I followed the instruction increasing the number of parameter servers assigned to the job in the logs above, but the error still exists. Hope to get some instructions.
I have encountered a similar problem, has a fix for this ever come up?
No, maybe there is no good solution to endure network bandwidth used up
@konnase are you running inside docker?
@konnase are you running inside docker?
Yes, docker on Kubernetes
Try running the docker containers with --net=host mode.
Try running the docker containers with
--net=hostmode.
Thanks, I will try it latter