benchmarks OSError on 128 GPUs for distributed

OSError on 128 GPUs for distributed_replicated on AWS P3

Open richardliaw opened this issue 6 years ago • 8 comments

Hi,

I'm trying to run a distributed replicated benchmark with 128 V100s and I'm getting a OSError.

Some more details:

Using AWS P3 instances (16 of them)
Batch size is 64
Running resnet101

Does anyone know how I can get around this issue or if there are any obvious mistakes that I'm making? The same commands work fine for 8 machines (64 GPUs)

I've pasted the commands run below:

##########
('Run the following commands on', '172.31.89.130')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=0 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=0 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.92.187')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=1 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=1 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.95.87')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=2 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=2 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.91.114')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=3 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=3 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.81.43')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=4 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=4 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.90.229')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=5 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=5 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.91.125')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=6 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=6 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.85.199')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=7 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=7 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.93.20')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=8 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=8 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.87.145')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=9 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=9 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.93.84')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=10 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=10 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.89.237')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=11 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=11 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.83.145')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=12 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=12 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.82.121')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=13 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=13 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.80.160')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=14 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=14 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

##########
('Run the following commands on', '172.31.85.86')
##########
CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=15 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=cpu --job_name=ps

python tf_cnn_benchmarks.py --worker_hosts=172.31.89.130:50001,172.31.92.187:50001,172.31.95.87:50001,172.31.91.114:50001,172.31.81.43:50001,172.31.90.229:50001,172.31.91.125:50001,172.31.85.199:50001,172.31.93.20:50001,172.31.87.145:50001,172.31.93.84:50001,172.31.89.237:50001,172.31.83.145:50001,172.31.82.121:50001,172.31.80.160:50001,172.31.85.86:50001 --num_gpus=8 --ps_hosts=172.31.89.130:50000,172.31.92.187:50000,172.31.95.87:50000,172.31.91.114:50000,172.31.81.43:50000,172.31.90.229:50000,172.31.91.125:50000,172.31.85.199:50000,172.31.93.20:50000,172.31.87.145:50000,172.31.93.84:50000,172.31.89.237:50000,172.31.83.145:50000,172.31.82.121:50000,172.31.80.160:50000,172.31.85.86:50000 --task_index=15 --batch_size=64 --model=resnet101 --variable_update=distributed_replicated --local_parameter_device=gpu --job_name=worker

Here is the stderr of the run (of one of the workers):

/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/matplotlib/__init__.py:962: UserWarning: Duplicate key in file "/home/ubuntu/.config/matplotlib/matplotlibrc", line #2
  (fname, cnt))
/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/matplotlib/__init__.py:962: UserWarning: Duplicate key in file "/home/ubuntu/.config/matplotlib/matplotlibrc", line #3
  (fname, cnt))
2018-04-17 00:27:36.779965: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-17 00:27:37.724333: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:37.725452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:37.949377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:37.950506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1d.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.200838: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.202080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1c.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.410280: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.411397: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1b.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.633574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.634718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 4 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1a.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:38.833131: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:38.834389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 5 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:19.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:39.027737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:39.029552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 6 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:18.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:39.261369: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-17 00:27:39.262446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 7 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:17.0
totalMemory: 15.77GiB freeMemory: 15.36GiB
2018-04-17 00:27:39.262746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3, 4, 5, 6, 7
2018-04-17 00:27:42.069669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-17 00:27:42.069719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3 4 5 6 7
2018-04-17 00:27:42.069731: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y Y Y Y N N N
2018-04-17 00:27:42.069738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N Y Y N Y N N
2018-04-17 00:27:42.069745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   Y Y N Y N N Y N
2018-04-17 00:27:42.069752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   Y Y Y N N N N Y
2018-04-17 00:27:42.069758: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 4:   Y N N N N Y Y Y
2018-04-17 00:27:42.069766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 5:   N Y N N Y N Y Y
2018-04-17 00:27:42.069772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 6:   N N Y N Y Y N Y
2018-04-17 00:27:42.069779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 7:   N N N Y Y Y Y N
2018-04-17 00:27:42.072586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 14867 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2018-04-17 00:27:42.248215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 14867 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-04-17 00:27:42.400232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:2 with 14867 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-04-17 00:27:42.582713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:3 with 14867 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-04-17 00:27:42.775726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:4 with 14867 MB memory) -> physical GPU (device: 4, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1a.0, compute capability: 7.0)
2018-04-17 00:27:42.933943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:5 with 14867 MB memory) -> physical GPU (device: 5, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:19.0, compute capability: 7.0)
2018-04-17 00:27:43.115514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:6 with 14867 MB memory) -> physical GPU (device: 6, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:18.0, compute capability: 7.0)
2018-04-17 00:27:43.309956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:7 with 14867 MB memory) -> physical GPU (device: 7, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:17.0, compute capability: 7.0)
2018-04-17 00:27:43.501367: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 172.31.89.130:50000, 1 -> 172.31.92.187:50000, 2 -> 172.31.95.87:50000, 3 -> 172.31.91.114:50000, 4 -> 172.31.81.43:50000, 5 -> 172.31.90.229:50000, 6 -> 172.31.91.125:50000, 7 -> 172.31.85.199:50000, 8 -> 172.31.93.20:50000, 9 -> 172.31.87.145:50000, 10 -> 172.31.93.84:50000, 11 -> 172.31.89.237:50000, 12 -> 172.31.83.145:50000, 13 -> 172.31.82.121:50000, 14 -> 172.31.80.160:50000, 15 -> 172.31.85.86:50000}
2018-04-17 00:27:43.501438: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:50001, 1 -> 172.31.92.187:50001, 2 -> 172.31.95.87:50001, 3 -> 172.31.91.114:50001, 4 -> 172.31.81.43:50001, 5 -> 172.31.90.229:50001, 6 -> 172.31.91.125:50001, 7 -> 172.31.85.199:50001, 8 -> 172.31.93.20:50001, 9 -> 172.31.87.145:50001, 10 -> 172.31.93.84:50001, 11 -> 172.31.89.237:50001, 12 -> 172.31.83.145:50001, 13 -> 172.31.82.121:50001, 14 -> 172.31.80.160:50001, 15 -> 172.31.85.86:50001}
2018-04-17 00:27:43.512903: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:50001
W0417 00:29:08.790456 139800790255360 tf_logging.py:126] From /home/ubuntu/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1504: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-04-17 00:29:19.709271: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unknown: Call dropped by load balancing policy
2018-04-17 00:29:25.709634: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:26.707462: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:27.711554: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:28.707978: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:29.704912: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:29.708827: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:29.711819: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:29.711866: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:29.711883: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:3
2018-04-17 00:29:29.711896: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:4
2018-04-17 00:29:29.711912: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:29.711925: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:29.711940: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:29.711956: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:29.711970: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:10
2018-04-17 00:29:29.711985: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:29.712000: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:29.712044: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:29.712067: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:29.712081: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:29.712093: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:2
2018-04-17 00:29:29.712107: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:4
2018-04-17 00:29:29.712119: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:29.712132: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:8
2018-04-17 00:29:29.712144: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:29.712157: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:29.712170: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:29.712182: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:29.712194: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:29:32.701156: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:32.704758: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:34.707827: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:37.707906: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:39.708875: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:39.712416: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:39.712459: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:39.712474: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:4
2018-04-17 00:29:39.712484: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:39.712495: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:39.712506: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:39.712517: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:39.712528: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:39.712545: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:39.712573: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:39.712586: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:39.712597: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:39.712610: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:39.712623: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:39.712635: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:39.712695: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:39.712711: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:39.712722: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:29:41.699911: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:29:49.712912: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:49.712973: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:49.712988: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:49.713000: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:49.713012: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:49.713023: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:49.713035: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:49.713047: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:49.713059: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:49.713074: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:49.713085: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:49.713099: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:49.713112: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:49.713125: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:49.713137: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:49.713150: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:49.713195: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:29:59.713392: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:29:59.713449: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:29:59.713466: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:29:59.713479: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:29:59.713491: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:29:59.713503: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:29:59.713515: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:29:59.713527: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:29:59.713544: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:29:59.713556: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:29:59.713570: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:29:59.713583: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:29:59.713597: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:29:59.713610: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:11
2018-04-17 00:29:59.713623: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:29:59.713636: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:29:59.713650: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:30:05.709934: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-04-17 00:30:09.713870: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:30:09.713939: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:30:09.713960: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:30:09.713972: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:30:09.714000: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:30:09.714015: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:30:09.714027: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:30:09.714060: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:30:09.714074: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:30:09.714087: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:30:09.714099: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:30:09.714113: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:30:09.714129: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:30:09.714141: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:30:09.714156: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:30:09.714168: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14
2018-04-17 00:30:19.714395: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-04-17 00:30:19.714461: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:2
2018-04-17 00:30:19.714477: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:5
2018-04-17 00:30:19.714490: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:6
2018-04-17 00:30:19.714502: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:7
2018-04-17 00:30:19.714515: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:8
2018-04-17 00:30:19.714528: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:11
2018-04-17 00:30:19.714540: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:12
2018-04-17 00:30:19.714555: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:13
2018-04-17 00:30:19.714568: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:14
2018-04-17 00:30:19.714582: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:15
2018-04-17 00:30:19.714595: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:7
2018-04-17 00:30:19.714620: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:10
2018-04-17 00:30:19.714634: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:12
2018-04-17 00:30:19.714649: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:13
2018-04-17 00:30:19.714662: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:14

Apr 17 '18 00:04 richardliaw

I have met this problem, has someone resolve it ?

May 11 '18 14:05 burness

I would suggest using the Uber Horovod solution. We are building in an easy multi-node solution into TensorFlow that includes a simple all-reduce setup, it is just not ready yet. I was the one that did most of the distribute testing for tf_cnn_bench and I have not run it in a long time. Even NVIDIA is using, although modified, Horovod around TensorFlow for very very very very large scale runs.

May 11 '18 14:05 tfboyd

@tfboyd However I really want to know what would cause the master init error

May 11 '18 15:05 burness

I am not testing this path right now and I am not aware of anyone that would debug it. With it being a path we are not currently following someone would have to have a guess or remember a similar error from the past. Myself I do not recall a similar error when I was testing.

I suggest Horovod because it has critical mass for answering questions. I doubt anyone is using this distributed_replicated. Trying to send you on a smoother path.

On Fri, May 11, 2018, 8:00 AM Burness Duan [email protected] wrote:

@tfboyd https://github.com/tfboyd However I really want to know what would cause the master init error

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/165#issuecomment-388389796, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZeslSDBmC_X42kxjCfKwuE9GBreRjHks5txad5gaJpZM4TXdSh .

May 11 '18 15:05 tfboyd

Any idea of when the TF solution would be ready? (ie, 3 months or so?)

May 11 '18 16:05 richardliaw

@richardliaw It seems to be caused by grpc and I change the tensorflow 1.60 to tensorflow 1.51, It seems run successful.

May 11 '18 16:05 burness

We are in progress I expect early edition in tf.contrib near the end of this quarter Q2 maybe pushing into Q3. Giving dates is always dangerous. There is a big push and people have been assigned. We took a step back to put more focus on make local machine multi-gpu easier to use, e.g. take your model pass it to estimator and done. Multi-node should end up working the same way although you will still need to coordinate the nodes which we believe will be done with kubernetes. This is all off the cuff and we are moving quickly.

On Fri, May 11, 2018 at 9:42 AM Burness Duan [email protected] wrote:

@richardliaw https://github.com/richardliaw It seems to be caused by grpc and I change the tensorflow 1.60 to tensorflow 1.51, It seems run successful.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/benchmarks/issues/165#issuecomment-388418788, or mute the thread https://github.com/notifications/unsubscribe-auth/AWZesn716PZSw8rg1Nr7BUAZnHZSTGCPks5txb94gaJpZM4TXdSh .

May 11 '18 16:05 tfboyd

This should not be broken, but I do not have time to look into it. I'm pretty sure Horovod only supports a single node.

@richardliaw, does this work when only 1 GPU is used per worker, or if the CPU is used instead of the GPU? If so, it would be a lot cheaper and easier to reproduce.

May 14 '18 23:05 reedwm

benchmarks benchmarks copied to clipboard

OSError on 128 GPUs for distributed_replicated on AWS P3

benchmarks
benchmarks copied to clipboard