IMPALA-Distributed-Tensorflow run the script showed in start.sh but failed。

trafficstars

My working environment is as follows: gtx 1650 ×1 core i7-9700 ×1 tensorflow==1.14.0 gym[atari] numpy tensorboardX opencv-python windows 10 I run the script showed in start.sh but failed and show me： 2020-09-22 09:24:12.969790: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:1 2020-09-22 09:24:12.972181: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:2 2020-09-22 09:24:12.975636: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:3 2020-09-22 09:24:12.977599: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:4 2020-09-22 09:24:12.979766: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:5 2020-09-22 09:24:12.981744: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:6 2020-09-22 09:24:12.983991: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:7 2020-09-22 09:24:12.986332: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:8 2020-09-22 09:24:12.988586: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:9 2020-09-22 09:24:12.992850: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:10 2020-09-22 09:24:12.995346: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:11 2020-09-22 09:24:12.997290: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:12 2020-09-22 09:24:12.999624: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:13 2020-09-22 09:24:13.002224: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:14 2020-09-22 09:24:13.005818: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:15 2020-09-22 09:24:13.007926: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:16 2020-09-22 09:24:13.013232: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:17 2020-09-22 09:24:13.015965: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:18 2020-09-22 09:24:13.018205: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:19 2020-09-22 09:24:13.020086: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:20 2020-09-22 09:24:13.022604: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:21 2020-09-22 09:24:13.026224: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:22 2020-09-22 09:24:13.028211: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:23 2020-09-22 09:24:13.030634: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:24 2020-09-22 09:24:13.032666: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:25 2020-09-22 09:24:13.035993: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:26 2020-09-22 09:24:13.037987: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:27 2020-09-22 09:24:13.040183: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:28 2020-09-22 09:24:13.042793: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:29 2020-09-22 09:24:13.044992: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:30 2020-09-22 09:24:13.046805: I tensorflow/core/distributed_runtime/master.cc:268] CreateSession still waiting for response from worker: /job:actor/replica:0/task:31

Sep 22 '20 01:09 zsdfaker

I think you may put command like

sh start.sh

But I consider that you have to command like

nohup sh start.sh

Thank you.

Apr 02 '21 00:04 chagmgang

Hello! I also ran into this problem. I ran the code for 24 hours on a gpu server and output = CreateSession still waiting for response from worker: /job:actor/replica:0/task:17... etc. was present the whole time. Does anyone know how to fix this? Thanks for the kind help!

Feb 27 '22 06:02 sunchipsster1

I think you maybe not running 17th actor task. Rerun the 17th task by command

python trainer_invader.py --num_actors=32 --task=17 --batch_size=32 --queue_size=128 --trajectory=20 --learning_frame=1000000000 --start_learning=0.0006 --end_learning=0.0 --discount_factor=0.99 --entropy_coef=0.05 --baseline_loss_coef=1.0 --gradient_clip_norm=40.0 --job_name=actor --reward_clipping=abs_one --lstm_size=256 &

Mar 01 '22 05:03 chagmgang

IMPALA-Distributed-Tensorflow IMPALA-Distributed-Tensorflow copied to clipboard

run the script showed in start.sh but failed。

IMPALA-Distributed-Tensorflow
IMPALA-Distributed-Tensorflow copied to clipboard