distributeTensorflowExample icon indicating copy to clipboard operation
distributeTensorflowExample copied to clipboard

一直是waiting for response

Open WoNiuHu opened this issue 7 years ago • 12 comments

如下所示,在两台服务器上分别跑了三个命令,一直是这种状态,是什么原因呢? 2017-09-28 20:22:28.534732: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2017-09-28 20:22:28.534903: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2017-09-28 20:22:38.535062: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2017-09-28 20:22:38.535111: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2017-09-28 20:22:48.535214: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2017-09-28 20:22:48.535265: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

WoNiuHu avatar Sep 28 '17 12:09 WoNiuHu

说明第一个ps节点和第二个worker节点因为说明原因没起来。

thewintersun avatar Sep 30 '17 08:09 thewintersun

您好,您这个问题解决了吗

weihualiuhupituzi avatar Dec 13 '18 12:12 weihualiuhupituzi

这个问题真是太恶心了,节点之间无法通信,有谁知道啥原因吗

JiayunjieJYJ avatar Jul 03 '19 02:07 JiayunjieJYJ

这个问题真是太恶心了,节点之间无法通信,有谁知道啥原因吗

机器之间端口通吗,有没有端口被之前的程序占用,先检查一下,或者不同机器上跑的模型结构是不是保证一样的了?

thewintersun avatar Jul 03 '19 03:07 thewintersun

这个问题真是太恶心了,节点之间无法通信,有谁知道啥原因吗

机器之间端口通吗,有没有端口被之前的程序占用,先检查一下,或者不同机器上跑的模型结构是不是保证一样的了?

你好 我是同一个机器上的一个ps两个worker 端口事先没被占用

JiayunjieJYJ avatar Jul 03 '19 03:07 JiayunjieJYJ

那不应该啊,难道是新版本的tf,老版本的 不好使了?

thewintersun avatar Jul 03 '19 03:07 thewintersun

tf是1.10和1.12的 试过在自己电脑上可以用,但是在公司的服务器(同一个机器)上就一直卡在 CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 了,试过device_filter,也不好使,,,哎

JiayunjieJYJ avatar Jul 03 '19 03:07 JiayunjieJYJ

你们公司的服务器端口限制了吧

------------------ 原始邮件 ------------------ 发件人: "JiayunjieJYJ"[email protected]; 发送时间: 2019年7月3日(星期三) 中午11:52 收件人: "thewintersun/distributeTensorflowExample"[email protected]; 抄送: "ShowTime"[email protected]; "Comment"[email protected]; 主题: Re: [thewintersun/distributeTensorflowExample] 一直是waiting for response (#11)

tf是1.10和1.12的 试过在自己电脑上可以用,但是在公司的服务器(同一个机器)上就一直卡在 CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 了,试过device_filter,也不好使,,,哎

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

thewintersun avatar Jul 03 '19 03:07 thewintersun

也有可能吧 我一会试试 另一个问题 我在自己电脑上是可以训练的,但是训练结束后会报个错: step: 996000, weight: 2.002096, biase: 10.002196, loss: 0.175542 step: 997000, weight: 2.002465, biase: 10.001983, loss: 0.172635 step: 998000, weight: 2.002988, biase: 10.001656, loss: 0.012112 step: 999000, weight: 2.002266, biase: 10.002346, loss: 0.201377 step: 1000000, weight: 2.002567, biase: 10.002910, loss: 0.007163 ERROR:tensorflow:================================== Object was never used (type <class 'tensorflow.python.framework.ops.Operation'>): <tf.Operation 'init' type=NoOp> If you want to mark it as used call its "mark_used()" method. It was originally created here: File "distribute.py2", line 110, in tf.app.run() File "/home/jiayunjie/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "distribute.py2", line 102, in main sv.stop() File "/home/jiayunjie/.local/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py", line 193, in wrapped return _add_should_use_warning(fn(*args, **kwargs))

意思是sv.stop()那一步出错了 请问您知道原因吗

JiayunjieJYJ avatar Jul 03 '19 03:07 JiayunjieJYJ

这个我也不知道了

------------------ 原始邮件 ------------------ 发件人: "JiayunjieJYJ"[email protected]; 发送时间: 2019年7月3日(星期三) 中午11:57 收件人: "thewintersun/distributeTensorflowExample"[email protected]; 抄送: "ShowTime"[email protected]; "Comment"[email protected]; 主题: Re: [thewintersun/distributeTensorflowExample] 一直是waiting for response (#11)

也有可能吧 我一会试试 另一个问题 我在自己电脑上是可以训练的,但是训练结束后会报个错: step: 996000, weight: 2.002096, biase: 10.002196, loss: 0.175542 step: 997000, weight: 2.002465, biase: 10.001983, loss: 0.172635 step: 998000, weight: 2.002988, biase: 10.001656, loss: 0.012112 step: 999000, weight: 2.002266, biase: 10.002346, loss: 0.201377 step: 1000000, weight: 2.002567, biase: 10.002910, loss: 0.007163 ERROR:tensorflow:================================== Object was never used (type <class 'tensorflow.python.framework.ops.Operation'>): <tf.Operation 'init' type=NoOp> If you want to mark it as used call its "mark_used()" method. It was originally created here: File "distribute.py2", line 110, in tf.app.run() File "/home/jiayunjie/.local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "distribute.py2", line 102, in main sv.stop() File "/home/jiayunjie/.local/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py", line 193, in wrapped return _add_should_use_warning(fn(*args, **kwargs))

意思是sv.stop()那一步出错了 请问您知道原因吗

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

thewintersun avatar Jul 03 '19 03:07 thewintersun

谢谢回复 实在不行就不分布式了,,,

JiayunjieJYJ avatar Jul 03 '19 04:07 JiayunjieJYJ

看看是不是端口占用了 我的换个别的端口就好了

MrRobotsAA avatar Jun 29 '22 15:06 MrRobotsAA