benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

Distributed Tensorflow fails with "Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized" log msg

Open antoajayraj opened this issue 7 years ago • 7 comments

Tried running distributed Tensorflow between 3 machines, with 1 as PS and the other 2 as workers. The PS and one of the workers start good as expected. However the second worker errors out saying "Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized" and keeps looping. I am using TF 1.5 and the latest benchmarks checkout. Am I missing something here ?

On Machine1:

CUDA_VISIBLE_DEVICES='' python tf_cnn_benchmarks.py --ps_hosts Machine1:20000 --worker_hosts Machine2:20001,Machine3:20001 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu --job_name=ps --task_index=0 --server_protocol=grpc --data_dir=/data/TF_records/ --data_name=imagenet ... ... 2018-03-05 08:14:50.441827: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:20000 TensorFlow: 1.5 Model: resnet50 Dataset: imagenet Mode: training SingleSess: False Batch size: 128 global 64 per device Num batches: 100 Num epochs: 0.01 Devices: ['/job:worker/task:0/gpu:0'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: parameter_server Sync: True

Running parameter server 0 <>

On Machine2:

CUDA_VISIBLE_DEVICES='0,1,2,3' python tf_cnn_benchmarks.py --ps_hosts Machine1:20000 --worker_hosts Machine2:20001,Machine3:20001 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu --job_name=worker --task_index=0 --server_protocol=grpc --data_dir=/data/TF_records/ --data_name=imagenet ... .... 2018-03-05 08:21:34.146108: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-03-05 08:21:44.146177: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-03-05 08:21:54.146250: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

On Machine3:

CUDA_VISIBLE_DEVICES='0,1,2,3' python tf_cnn_benchmarks.py --ps_hosts Machine1:20000 --worker_hosts Machine2:20001,Machine3:20001 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu --job_name=worker --task_index=1 --server_protocol=grpc --data_dir=/data/TF_records/ --data_name=imagenet

.... ...

2018-03-05 08:21:05.544111: I tensorflow/core/distributed_runtime/master_session.cc:1017] Start master session 613308a87e96c4e7 with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 158 gpu_options { force_gpu_compatible: true } allow_soft_placement: true I0305 08:21:05.955172 70367372399440 tf_logging.py:110] Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: global_step, v/cg/conv0/conv2d/kernel, v/cg/conv0/batchnorm0/gamma, v/cg/conv0/batchnorm0/beta, v/cg/conv0/batchnorm0/moving_mean, v/cg/conv0/batchnorm0/moving_variance, v/cg/resnet_v10/conv1/conv2d/kernel, v/cg/resnet_v10/conv1/batchnorm1/gamma, v/cg/resnet_v10/conv1/batchnorm1/beta, v/cg/resnet_v10/conv1/batchnorm1/moving_mean, v/cg/resnet_v10/conv1/batchnorm1/moving_variance, v/cg/resnet_v10/conv2/conv2d/kernel, v/cg/resnet_v10/conv2/batchnorm2/gamma, v/cg/resnet_v10/conv2/batchnorm2/beta, v/cg/resnet_v10/conv2/batchnorm2/moving_mean, v/cg/resnet_v10/conv2/batchnorm2/moving_variance, v/cg/resnet_v10/conv3/conv2d/kernel, v/cg/resnet_v10/conv3/batchnorm3/gamma, v/cg/resnet_v10/conv3/batchnorm3/beta, v/cg/resnet_v10/conv3/batchnorm3/moving_mean, v/cg/resnet_v10/conv3/batchnorm3/moving_variance, v/cg/resnet_v10/conv4/conv2d/kernel, v/cg/resnet_v10/conv4/batchnorm4/gamma, v/cg/resnet_v10/conv4/batchnorm4/beta, v/cg/resnet_v10/conv4/batchnorm4/moving_mean, v/cg/resnet_v10/conv4/batchnorm4/moving_variance, v/cg/resnet_v11/conv5/conv2d/kernel, v/cg/resnet_v11/conv5/batchnorm5/gamma, v/cg/resnet_v11/conv5/batchnorm5/beta, v/cg/resnet_v11/conv5/batchnorm5/moving_mean, v/cg/resnet_v11/conv5/batchnorm5/moving_variance, v/cg/resnet_v11/conv6/conv2d/kernel, v/cg/resnet_v11/conv6/batchnorm6/gamma, v/cg/resnet_v11/conv6/batchnorm6/beta, v/cg/resnet_v11/conv6/batchnorm6/moving_mean, v/cg/resnet_v11/conv6/batchnorm6/moving_variance, v/cg/resnet_v11/conv7/conv2d/kernel, v/cg/resnet_v11/conv7/batchnorm7/gamma, v/cg/resnet_v11/conv7/batchnorm7/beta, v/cg/resnet_v11/conv7/batchnorm7/moving_mean, v/cg/resnet_v11/conv7/batchnorm7/moving_variance, v/cg/resnet_v12/conv8/conv2d/kernel, v/cg/resnet_v12/conv8/batchnorm8/gamma, v/cg/resnet_v12/conv8/batchnorm8/beta, v/cg/resnet_v12/conv8/batchnorm8/moving_mean, v/cg/resnet_v12/conv8/batchnorm8/moving_variance, v/cg/resnet_v12/conv9/conv2d/kernel, v/cg/resnet_v12/conv9/batchnorm9/gamma, v/cg/resnet_v12/conv9/batchnorm9/beta, v/cg/resnet_v12/conv9/batchnorm9/moving_mean, v/cg/resnet_v12/conv9/batchnorm9/moving_variance, v/cg/resnet_v12/conv10/conv2d/kernel, v/cg/resnet_v12/conv10/batchnorm10/gamma, v/cg/resnet_v12/conv10/batchnorm10/beta, v/cg/resnet_v12/conv10/batchnorm10/moving_mean, v/cg/resnet_v12/conv10/batchnorm10/moving_variance, v/cg/resnet_v13/conv11/conv2d/kernel, v/cg/resnet_v13/conv11/batchnorm11/gamma, v/cg/resnet_v13/conv11/batchnorm11/beta, v/cg/resnet_v13/conv11/batchnorm11/moving_mean, v/cg/resnet_v13/conv11/batchnorm11/moving_variance, v/cg/resnet_v13/conv12/conv2d/kernel, v/cg/resnet_v13/conv12/batchnorm12/gamma, v/cg/resnet_v13/conv12/batchnorm12/beta, v/cg/resnet_v13/conv12/batchnorm12/moving_mean, v/cg/resnet_v13/conv12/batchnorm12/moving_variance, v/cg/resnet_v13/conv13/conv2d/kernel, v/cg/resnet_v13/conv13/batchnorm13/gamma, v/cg/resnet_v13/conv13/batchnorm13/beta, v/cg/resnet_v13/conv13/batchnorm13/moving_mean, v/cg/resnet_v13/conv13/batchnorm13/moving_variance, v/cg/resnet_v13/conv14/conv2d/kernel, v/cg/resnet_v13/conv14/batchnorm14/gamma, v/cg/resnet_v13/conv14/batchnorm14/beta, v/cg/resnet_v13/conv14/batchnorm14/moving_mean, v/cg/resnet_v13/conv14/batchnorm14/moving_variance, v/cg/resnet_v14/conv15/conv2d/kernel, v/cg/resnet_v14/conv15/batchnorm15/gamma, v/cg/resnet_v14/conv15/batchnorm15/beta, v/cg/resnet_v14/conv15/batchnorm15/moving_mean, v/cg/resnet_v14/conv15/batchnorm15/moving_variance, v/cg/resnet_v14/conv16/conv2d/kernel, v/cg/resnet_v14/conv16/batchnorm16/gamma, v/cg/resnet_v14/conv16/batchnorm16/beta, v/cg/resnet_v14/conv16/batchnorm16/moving_mean, v/cg/resnet_v14/conv16/batchnorm16/moving_variance, v/cg/resnet_v14/conv17/conv2d/kernel, v/cg/resnet_v14/conv17/batchnorm17/gamma, v/cg/resnet_v14/conv17/batchnorm17/beta, v/cg/resnet_v14/conv17/batchnorm17/moving_mean, v/cg/resnet_v14/conv17/batchnorm17/moving_variance, v/cg/resnet_v15/conv18/conv2d/kernel, v/cg/resnet_v15/conv18/batchnorm18/gamma, v/cg/resnet_v15/conv18/batchnorm18/beta, v/cg/resnet_v15/conv18/batchnorm18/moving_mean, v/cg/resnet_v15/conv18/batchnorm18/moving_variance, v/cg/resnet_v15/conv19/conv2d/kernel, v/cg/resnet_v15/conv19/batchnorm19/gamma, v/cg/resnet_v15/conv19/batchnorm19/beta, v/cg/resnet_v15/conv19/batchnorm19/moving_mean, v/cg/resnet_v15/conv19/batchnorm19/moving_variance, v/cg/resnet_v15/conv20/conv2d/kernel, v/cg/resnet_v15/conv20/batchnorm20/gamma, v/cg/resnet_v15/conv20/batchnorm20/beta, v/cg/resnet_v15/conv20/batchnorm20/moving_mean, v/cg/resnet_v15/conv20/batchnorm20/moving_variance, v/cg/resnet_v16/conv21/conv2d/kernel, v/cg/resnet_v16/conv21/batchnorm21/gamma, v/cg/resnet_v16/conv21/batchnorm21/beta, v/cg/resnet_v16/conv21/batchnorm21/moving_mean, v/cg/resnet_v16/conv21/batchnorm21/moving_variance, v/cg/resnet_v16/conv22/conv2d/kernel, v/cg/resnet_v16/conv22/batchnorm22/gamma, v/cg/resnet_v16/conv22/batchnorm22/beta, v/cg/resnet_v16/conv22/batchnorm22/moving_mean, v/cg/resnet_v16/conv22/batchnorm22/moving_variance, v/cg/resnet_v16/conv23/conv2d/kernel, v/cg/resnet_v16/conv23/batchnorm23/gamma, v/cg/resnet_v16/conv23/batchnorm23/beta, v/cg/resnet_v16/conv23/batchnorm23/moving_mean, v/cg/resnet_v16/conv23/batchnorm23/moving_variance, v/cg/resnet_v17/conv24/conv2d/kernel, v/cg/resnet_v17/conv24/batchnorm24/gamma, v/cg/resnet_v17/conv24/batchnorm24/beta, v/cg/resnet_v17/conv24/batchnorm24/moving_mean, v/cg/resnet_v17/conv24/batchnorm24/moving_variance, v/cg/resnet_v17/conv25/conv2d/kernel, v/cg/resnet_v17/conv25/batchnorm25/gamma, v/cg/resnet_v17/conv25/batchnorm25/beta, v/cg/resnet_v17/conv25/batchnorm25/moving_mean, v/cg/resnet_v17/conv25/batchnorm25/moving_variance, v/cg/resnet_v17/conv26/conv2d/kernel, v/cg/resnet_v17/conv26/batchnorm26/gamma, v/cg/resnet_v17/conv26/batchnorm26/beta, v/cg/resnet_v17/conv26/batchnorm26/moving_mean, v/cg/resnet_v17/conv26/batchnorm26/moving_variance, v/cg/resnet_v17/conv27/conv2d/kernel, v/cg/resnet_v17/conv27/batchnorm27/gamma, v/cg/resnet_v17/conv27/batchnorm27/beta, v/cg/resnet_v17/conv27/batchnorm27/moving_mean, v/cg/resnet_v17/conv27/batchnorm27/moving_variance, v/cg/resnet_v18/conv28/conv2d/kernel, v/cg/resnet_v18/conv28/batchnorm28/gamma, v/cg/resnet_v18/conv28/batchnorm28/beta, v/cg/resnet_v18/conv28/batchnorm28/moving_mean, v/cg/resnet_v18/conv28/batchnorm28/moving_variance, v/cg/resnet_v18/conv29/conv2d/kernel, v/cg/resnet_v18/conv29/batchnorm29/gamma, v/cg/resnet_v18/conv29/batchnorm29/beta, v/cg/resnet_v18/conv29/batchnorm29/moving_mean, v/cg/resnet_v18/conv29/batchnorm29/moving_variance, v/cg/resnet_v18/conv30/conv2d/kernel, v/cg/resnet_v18/conv30/batchnorm30/gamma, v/cg/resnet_v18/conv30/batchnorm30/beta, v/cg/resnet_v18/conv30/batchnorm30/moving_mean, v/cg/resnet_v18/conv30/batchnorm30/moving_variance, v/cg/resnet_v19/conv31/conv2d/kernel, v/cg/resnet_v19/conv31/batchnorm31/gamma, v/cg/resnet_v19/conv31/batchnorm31/beta, v/cg/resnet_v19/conv31/batchnorm31/moving_mean, v/cg/resnet_v19/conv31/batchnorm31/moving_variance, v/cg/resnet_v19/conv32/conv2d/kernel, v/cg/resnet_v19/conv32/batchnorm32/gamma, v/cg/resnet_v19/conv32/batchnorm32/beta, v/cg/resnet_v19/conv32/batchnorm32/moving_mean, v/cg/resnet_v19/conv32/batchnorm32/moving_variance, v/cg/resnet_v19/conv33/conv2d/kernel, v/cg/resnet_v19/conv33/batchnorm33/gamma, v/cg/resnet_v19/conv33/batchnorm33/beta, v/cg/resnet_v19/conv33/batchnorm33/moving_mean, v/cg/resnet_v19/conv33/batchnorm33/moving_variance, v/cg/resnet_v110/conv34/conv2d/kernel, v/cg/resnet_v110/conv34/batchnorm34/gamma, v/cg/resnet_v110/conv34/batchnorm34/beta, v/cg/resnet_v110/conv34/batchnorm34/moving_mean, v/cg/resnet_v110/conv34/batchnorm34/moving_variance, v/cg/resnet_v110/conv35/conv2d/kernel, v/cg/resnet_v110/conv35/batchnorm35/gamma, v/cg/resnet_v110/conv35/batchnorm35/beta, v/cg/resnet_v110/conv35/batchnorm35/moving_mean, v/cg/resnet_v110/conv35/batchnorm35/moving_variance, v/cg/resnet_v110/conv36/conv2d/kernel, v/cg/resnet_v110/conv36/batchnorm36/gamma, v/cg/resnet_v110/conv36/batchnorm36/beta, v/cg/resnet_v110/conv36/batchnorm36/moving_mean, v/cg/resnet_v110/conv36/batchnorm36/moving_variance, v/cg/resnet_v111/conv37/conv2d/kernel, v/cg/resnet_v111/conv37/batchnorm37/gamma, v/cg/resnet_v111/conv37/batchnorm37/beta, v/cg/resnet_v111/conv37/batchnorm37/moving_mean, v/cg/resnet_v111/conv37/batchnorm37/moving_variance, v/cg/resnet_v111/conv38/conv2d/kernel, v/cg/resnet_v111/conv38/batchnorm38/gamma, v/cg/resnet_v111/conv38/batchnorm38/beta, v/cg/resnet_v111/conv38/batchnorm38/moving_mean, v/cg/resnet_v111/conv38/batchnorm38/moving_variance, v/cg/resnet_v111/conv39/conv2d/kernel, v/cg/resnet_v111/conv39/batchnorm39/gamma, v/cg/resnet_v111/conv39/batchnorm39/beta, v/cg/resnet_v111/conv39/batchnorm39/moving_mean, v/cg/resnet_v111/conv39/batchnorm39/moving_variance, v/cg/resnet_v112/conv40/conv2d/kernel, v/cg/resnet_v112/conv40/batchnorm40/gamma, v/cg/resnet_v112/conv40/batchnorm40/beta, v/cg/resnet_v112/conv40/batchnorm40/moving_mean, v/cg/resnet_v112/conv40/batchnorm40/moving_variance, v/cg/resnet_v112/conv41/conv2d/kernel, v/cg/resnet_v112/conv41/batchnorm41/gamma, v/cg/resnet_v112/conv41/batchnorm41/beta, v/cg/resnet_v112/conv41/batchnorm41/moving_mean, v/cg/resnet_v112/conv41/batchnorm41/moving_variance, v/cg/resnet_v112/conv42/conv2d/kernel, v/cg/resnet_v112/conv42/batchnorm42/gamma, v/cg/resnet_v112/conv42/batchnorm42/beta, v/cg/resnet_v112/conv42/batchnorm42/moving_mean, v/cg/resnet_v112/conv42/batchnorm42/moving_variance, v/cg/resnet_v113/conv43/conv2d/kernel, v/cg/resnet_v113/conv43/batchnorm43/gamma, v/cg/resnet_v113/conv43/batchnorm43/beta, v/cg/resnet_v113/conv43/batchnorm43/moving_mean, v/cg/resnet_v113/conv43/batchnorm43/moving_variance, v/cg/resnet_v113/conv44/conv2d/kernel, v/cg/resnet_v113/conv44/batchnorm44/gamma, v/cg/resnet_v113/conv44/batchnorm44/beta, v/cg/resnet_v113/conv44/batchnorm44/moving_mean, v/cg/resnet_v113/conv44/batchnorm44/moving_variance, v/cg/resnet_v113/conv45/conv2d/kernel, v/cg/resnet_v113/conv45/batchnorm45/gamma, v/cg/resnet_v113/conv45/batchnorm45/beta, v/cg/resnet_v113/conv45/batchnorm45/moving_mean, v/cg/resnet_v113/conv45/batchnorm45/moving_variance, v/cg/resnet_v113/conv46/conv2d/kernel, v/cg/resnet_v113/conv46/batchnorm46/gamma, v/cg/resnet_v113/conv46/batchnorm46/beta, v/cg/resnet_v113/conv46/batchnorm46/moving_mean, v/cg/resnet_v113/conv46/batchnorm46/moving_variance, v/cg/resnet_v114/conv47/conv2d/kernel, v/cg/resnet_v114/conv47/batchnorm47/gamma, v/cg/resnet_v114/conv47/batchnorm47/beta, v/cg/resnet_v114/conv47/batchnorm47/moving_mean, v/cg/resnet_v114/conv47/batchnorm47/moving_variance, v/cg/resnet_v114/conv48/conv2d/kernel, v/cg/resnet_v114/conv48/batchnorm48/gamma, v/cg/resnet_v114/conv48/batchnorm48/beta, v/cg/resnet_v114/conv48/batchnorm48/moving_mean, v/cg/resnet_v114/conv48/batchnorm48/moving_variance, v/cg/resnet_v114/conv49/conv2d/kernel, v/cg/resnet_v114/conv49/batchnorm49/gamma, v/cg/resnet_v114/conv49/batchnorm49/beta, v/cg/resnet_v114/conv49/batchnorm49/moving_mean, v/cg/resnet_v114/conv49/batchnorm49/moving_variance, v/cg/resnet_v115/conv50/conv2d/kernel, v/cg/resnet_v115/conv50/batchnorm50/gamma, v/cg/resnet_v115/conv50/batchnorm50/beta, v/cg/resnet_v115/conv50/batchnorm50/moving_mean, v/cg/resnet_v115/conv50/batchnorm50/moving_variance, v/cg/resnet_v115/conv51/conv2d/kernel, v/cg/resnet_v115/conv51/batchnorm51/gamma, v/cg/resnet_v115/conv51/batchnorm51/beta, v/cg/resnet_v115/conv51/batchnorm51/moving_mean, v/cg/resnet_v115/conv51/batchnorm51/moving_variance, v/cg/resnet_v115/conv52/conv2d/kernel, v/cg/resnet_v115/conv52/batchnorm52/gamma, v/cg/resnet_v115/conv52/batchnorm52/beta, v/cg/resnet_v115/conv52/batchnorm52/moving_mean, v/cg/resnet_v115/conv52/batchnorm52/moving_variance, v/cg/affine0/weights, v/cg/affine0/biases, ready: None

antoajayraj avatar Mar 05 '18 14:03 antoajayraj

This seems similar to #142. Did you try waiting at least a minute for the second worker? If so, can you paste the full output of the second worker after waiting a minute?

reedwm avatar Mar 05 '18 19:03 reedwm

@reedwm I had looked into issue #142 and also waited for almost 15 minutes and its still the same. Attaching the logs here as requested after waiting for 3 #minutes. Machine2_moreWait.log Machine3_moreWait.log

antoajayraj avatar Mar 06 '18 04:03 antoajayraj

Hmmm, I cannot reproduce.

@mrry, any ideas? I am not super familiar with the distributed runtime code so it's hard for me to debug this without being able to reproduce. In the logs, the first worker spews out the line

CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

The second worker checks if variables are initialized every 30 seconds, as expected. Each time, it outputs

Start master session f0ebda29b7eb0d31 with config: intra_op_parallelism_threads: 1 inter_op_parallelism_threads: 158 gpu_options { force_gpu_compatible: true } allow_soft_placement: true

I'm not sure why the second worker is able to start when the first worker is not.

reedwm avatar Mar 07 '18 01:03 reedwm

Usually this means the two workers have chosen different devices for their (shared) variables. One useful step is to log the GraphDef for each worker, then compare the chosen devices for one of the variables named in the "Variables not initialized: ..." error message.

(The "CreateSession still waiting for response from worker" error message usually means that the network addresses in the ClusterSpec are configured incorrectly. I don't think @antoajayraj is seeing that problem.)

mrry avatar Mar 07 '18 01:03 mrry

The "CreateSession still waiting for response from worker" are from @antoajayraj's logs. When I run, I get no issues at all.

@antoajayraj, per @mrry's advice, can you double check you passed the correct values to --ps_hosts, --worker_hosts, --task_name, and --job_name? The values look right to me, but perhaps you misspelled something in the hostnames.

reedwm avatar Mar 07 '18 18:03 reedwm

@reedwm I rechecked the values and they are correct. :( Could it be a problem with having both IB and Ethernet adapters on the same box or something ??? just a wild wild guess

antoajayraj avatar Mar 08 '18 13:03 antoajayraj

I had the similar output:

I0416 08:30:39.696003 140631940617984 tf_logging.py:116] Waiting for model to be ready. 
Ready_for_local_init_op:  Variables not initialized: global_step, ps_var/v0/cg/conv0/conv2d/kernel, 
ps_var/v0/cg/conv0/batchnorm0/gamma, ps_var/v0/cg/conv0/batchnorm0/beta,...,ready: None

The problem went away after I sync the docker image version and checkout to compatible branch on every machine. I'm using the following command:

nvidia-docker run -it --net=host tensorflow/tensorflow:1.6.0-gpu /bin/bash
apt-get update && apt-get install git && git clone https://github.com/tensorflow/benchmarks.git
cd benchmarks/scripts/tf_cnn_benchmarks
git checkout cnn_tf_v1.5_compatible

FYI

ChiaraHsieh avatar Apr 17 '18 06:04 ChiaraHsieh