benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

It's necessary to endure that the network bandwidth is used up

Open konnase opened this issue 6 years ago • 6 comments

I created an issue #284 about 4 months ago, and I suggested that we should replace tf.train.Supervisor with tf.train.MonitoredTrainingSession, as the later will restart session when facing OS Error(or communication error between PS and workers caused by network bandwidth used up), but I got no response. I tried to do it myself, and I found that when the communication overhead of one machine exceeds its network bandwidth, a worker will throw _PREEMPTION_ERRORS. And MonitoredTrainingSession on the worker will capture the error and restart MonitoredTrainingSession, but it will be blocked in initialing stage.

I0523 11:21:32.438139 140497580463872 tf_logging.py:115] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: OS Error
I0523 11:21:32.580840 140497580463872 tf_logging.py:115] Graph was finalized.
2019-05-23 11:21:32.631653: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 2d2cf0b67505bc5e with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 18 gpu_options { } allow_soft_placement: true experimental { collective_group_leader: "/job:worker/replica:0/task:0" }
I0523 11:21:32.963238 140497580463872 tf_logging.py:115] Running local_init_op.

I followed the instruction increasing the number of parameter servers assigned to the job in the logs above, but the error still exists. Hope to get some instructions.

konnase avatar May 24 '19 14:05 konnase

I have encountered a similar problem, has a fix for this ever come up?

aaron276h avatar Jul 26 '19 18:07 aaron276h

No, maybe there is no good solution to endure network bandwidth used up

konnase avatar Jul 31 '19 15:07 konnase

@konnase are you running inside docker?

aaron276h avatar Aug 02 '19 23:08 aaron276h

@konnase are you running inside docker?

Yes, docker on Kubernetes

konnase avatar Aug 05 '19 01:08 konnase

Try running the docker containers with --net=host mode.

aaron276h avatar Aug 05 '19 01:08 aaron276h

Try running the docker containers with --net=host mode.

Thanks, I will try it latter

konnase avatar Aug 06 '19 05:08 konnase