tensor2tensor Hanging when running transformer in distributed setting

Description

I am trying to run transformer using 1 worker and 1 ps in async mode. The program hangs after printing INFO:tensorflow:Graph was finalized.

Environment information

OS: Ubuntu 16.04

$ pip freeze | grep tensor
 -e git+https://github.com/tensorflow/tensor2tensor.git@342e214dea360a7f472fc82f3dd0775d7e224c52#egg=tensor2tensor
tensorboard==1.8.0
tensorflow==1.8.0
tensorflow-tensorboard==1.5.0
$ python -V
 Python 3.5.2

For bugs: reproduction and error logs

# Steps to reproduce:
# command to start worker:
TF_CONFIG='{"task": {"type": "master", "index": 0}, "environment": "cloud", "cluster": {"ps": ["10.117.1.30:12000"], "master": ["10.117.1.30:11000"]}}' ./tensor2tensor/bin/t2t-trainer --hparams_set=transformer_base --model=transformer --output_dir=$OUTPUT_DIR
 --tmp_dir=$TMP_DIR --data_dir=$DATA_DIR/translate_ende_wmt32k --problem=translate_ende_wmt32k --worker_replicas=1 --ps_gpu=0 --worker_job=/job:master --master=grpc://10.117.1.30:11000 --schedule=train --ps_replicas=1 --worker_gpu=1 --worker_id=0

# command to start ps:
TF_CONFIG='{"task": {"type": "ps", "index": 0}, "environment": "cloud", "cluster": {"ps": ["10.117.1.30:12000"], "master": ["10.117.1.30:11000"]}}' CUDA_VISIBLE_DEVICES='' ./tensor2tensor/bin/t2t-trainer --hparams_set=transformer_base --model=transformer --output_dir=$OUTPUT_DIR
 --tmp_dir=$TMP_DIR --data_dir=$DATA_DIR/translate_ende_wmt32k --problem=translate_ende_wmt32k --master=grpc://10.117.1.30:12000 --schedule=run_std_server

# Error logs (the last a couple of lines):
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_33945_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_33945_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_33945_512.top
INFO:tensorflow:Base learning rate: 2.000000
INFO:tensorflow:Trainable Variables Total size: 61499904
INFO:tensorflow:Using optimizer Adam
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.

I think I've located where the problem happens: the worker makes an RPC call for CreateSession (in grpc_remote_master.cc) but the call was not handled by an RPC server.

Prior to that, the worker process created a GRPC channel on address 10.117.1.30:11000 (this is the master address for this worker process), which was used to create the master GRPC stub. But there was no GRPC server created listening on this address.

So my question is: should the worker have created a GRPC master service to listen on 10.117.1.30:11000 but it didn't?

Jul 09 '18 21:07 jinliangwei

Update:

A tensorflow.train.Server object is created only if the schedule is "run_std_server". I modified the code to create a server object for workers in distributed training. The program doesn't hang anymore and training seems normal now.

It would be great if someone could comment on whether this is a proper fix.

Jul 10 '18 20:07 jinliangwei

Hi, I met a problem when I use four machines to distributed train t2t model in MT task. There is 1 master(to update the parameters) and 3 ps(to compute the gradients), each one has 8 gpus. I make the both 4 machines use the same port 5000 to communicate ,but the training speed is too slow,200 secs per 100 steps(single machine is 55 secs ). I think there must be something wrong , but I have no idea now. I wonder should I set each machines with a different port number ?

Jul 27 '18 10:07 libeineu

@jinliangwei hello，I met the same problem. Could you tell me how you solved it? thks

Aug 03 '18 09:08 Mack-y

@Mack-y Hi, sorry for the late reply. Basically, in trainer_lib.py, I created a tf.train.Server in create_experiment() if it's distributed training and the schedule is not run_std_server. That's all.

Code to create server:

def create_tf_server(config):                                                                                                                                                                                                                                                                                                                                                                                                                                                  
  server = tf.train.Server(
    config.cluster_spec,
    job_name=config.task_type,
    task_index=config.task_id,
    config=config.tf_config,
    start=True)
  return server

Aug 13 '18 22:08 jinliangwei

@jinliangwei please give more details about the tensor2tensor distributed training solutions concerning create_experiment() and tf.train.server

Aug 14 '18 06:08 harvey1994

@harvey1994 See here for my patch that gets distributed tensor2tensor working: https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866

Aug 22 '18 19:08 jinliangwei

I ran into similar problems. I also found that with T2T 1.10.0, if you try to set schedul to run_std_server it will crash; details here https://github.com/kubeflow/examples/issues/208#issuecomment-436720653

Here's some info on how I worked around this: https://github.com/kubeflow/examples/issues/208#issuecomment-436846074

Nov 08 '18 02:11 jlewi

@jinliangwei Thanks for your solution ,but I encountered an oom error,in the meanwhile,free memory is enough,tensorflow and tensor2tensor are both 1.8,Can you offer some suggestions?

Jan 22 '19 07:01 upwindflys

@upwindflys One possible cause is that you are running on GPUs but your T2T model is using float16 instead of float32 (check hparams like activation_dtype). Not all TensorFlow operations have a float16 implementation on GPU and those operations are allocated on a special device called XLA_GPU, which has little memory.

Jan 24 '19 06:01 jinliangwei

@jinliangwei thanks,but that doesn't seem to be this problem,thanks anyway.

Jan 28 '19 03:01 upwindflys

@jinliangwei sorry to ask you a question again.What kind of physical environment do you use?such as the cuDNN CUDA version?

Jan 28 '19 06:01 upwindflys

Update:

A tensorflow.train.Server object is created only if the schedule is "run_std_server". I modified the code to create a server object for workers in distributed training. The program doesn't hang anymore and training seems normal now.

It would be great if someone could comment on whether this is a proper fix.

I met the same problem in the current version tensor2tensor and fixed it by @jinliangwei 's method. Could tensor2tensor team fix it officially?

Jul 23 '19 11:07 outstandingcandy

I met the same problem as well. interestingly tho, estimator are supposed to handle the start_std_server in tf.estimator.training but for some reasons, it is not able to in current t2t.

Jul 23 '19 12:07 colmantse

@harvey1994 See here for my patch that gets distributed tensor2tensor working: https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866

self._server = server
where was self._server used? can you share the complete file?

Aug 13 '20 02:08 jiahuigeng