tensor2tensor
tensor2tensor copied to clipboard
Hanging when running transformer in distributed setting
Description
I am trying to run transformer using 1 worker and 1 ps in async mode. The program hangs after printing INFO:tensorflow:Graph was finalized.
Environment information
OS: Ubuntu 16.04
$ pip freeze | grep tensor
-e git+https://github.com/tensorflow/tensor2tensor.git@342e214dea360a7f472fc82f3dd0775d7e224c52#egg=tensor2tensor
tensorboard==1.8.0
tensorflow==1.8.0
tensorflow-tensorboard==1.5.0
$ python -V
Python 3.5.2
For bugs: reproduction and error logs
# Steps to reproduce:
# command to start worker:
TF_CONFIG='{"task": {"type": "master", "index": 0}, "environment": "cloud", "cluster": {"ps": ["10.117.1.30:12000"], "master": ["10.117.1.30:11000"]}}' ./tensor2tensor/bin/t2t-trainer --hparams_set=transformer_base --model=transformer --output_dir=$OUTPUT_DIR
--tmp_dir=$TMP_DIR --data_dir=$DATA_DIR/translate_ende_wmt32k --problem=translate_ende_wmt32k --worker_replicas=1 --ps_gpu=0 --worker_job=/job:master --master=grpc://10.117.1.30:11000 --schedule=train --ps_replicas=1 --worker_gpu=1 --worker_id=0
# command to start ps:
TF_CONFIG='{"task": {"type": "ps", "index": 0}, "environment": "cloud", "cluster": {"ps": ["10.117.1.30:12000"], "master": ["10.117.1.30:11000"]}}' CUDA_VISIBLE_DEVICES='' ./tensor2tensor/bin/t2t-trainer --hparams_set=transformer_base --model=transformer --output_dir=$OUTPUT_DIR
--tmp_dir=$TMP_DIR --data_dir=$DATA_DIR/translate_ende_wmt32k --problem=translate_ende_wmt32k --master=grpc://10.117.1.30:12000 --schedule=run_std_server
# Error logs (the last a couple of lines):
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_33945_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_33945_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_33945_512.top
INFO:tensorflow:Base learning rate: 2.000000
INFO:tensorflow:Trainable Variables Total size: 61499904
INFO:tensorflow:Using optimizer Adam
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
I think I've located where the problem happens: the worker makes an RPC call for CreateSession (in grpc_remote_master.cc) but the call was not handled by an RPC server.
Prior to that, the worker process created a GRPC channel on address 10.117.1.30:11000 (this is the master address for this worker process), which was used to create the master GRPC stub. But there was no GRPC server created listening on this address.
So my question is: should the worker have created a GRPC master service to listen on 10.117.1.30:11000 but it didn't?
Update:
A tensorflow.train.Server object is created only if the schedule is "run_std_server". I modified the code to create a server object for workers in distributed training. The program doesn't hang anymore and training seems normal now.
It would be great if someone could comment on whether this is a proper fix.
Hi, I met a problem when I use four machines to distributed train t2t model in MT task. There is 1 master(to update the parameters) and 3 ps(to compute the gradients), each one has 8 gpus. I make the both 4 machines use the same port 5000 to communicate ,but the training speed is too slow,200 secs per 100 steps(single machine is 55 secs ). I think there must be something wrong , but I have no idea now. I wonder should I set each machines with a different port number ?
@jinliangwei hello,I met the same problem. Could you tell me how you solved it? thks
@Mack-y Hi, sorry for the late reply. Basically, in trainer_lib.py, I created a tf.train.Server in create_experiment() if it's distributed training and the schedule is not run_std_server. That's all.
Code to create server:
def create_tf_server(config):
server = tf.train.Server(
config.cluster_spec,
job_name=config.task_type,
task_index=config.task_id,
config=config.tf_config,
start=True)
return server
@jinliangwei please give more details about the tensor2tensor distributed training solutions concerning create_experiment() and tf.train.server
@harvey1994 See here for my patch that gets distributed tensor2tensor working: https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866
I ran into similar problems. I also found that with T2T 1.10.0, if you try to set schedul to run_std_server it will crash; details here https://github.com/kubeflow/examples/issues/208#issuecomment-436720653
Here's some info on how I worked around this: https://github.com/kubeflow/examples/issues/208#issuecomment-436846074
@jinliangwei Thanks for your solution ,but I encountered an oom error,in the meanwhile,free memory is enough,tensorflow and tensor2tensor are both 1.8,Can you offer some suggestions?
@upwindflys One possible cause is that you are running on GPUs but your T2T model is using float16 instead of float32 (check hparams like activation_dtype). Not all TensorFlow operations have a float16 implementation on GPU and those operations are allocated on a special device called XLA_GPU, which has little memory.
@jinliangwei thanks,but that doesn't seem to be this problem,thanks anyway.
@jinliangwei sorry to ask you a question again.What kind of physical environment do you use?such as the cuDNN CUDA version?
Update:
A tensorflow.train.Server object is created only if the schedule is "run_std_server". I modified the code to create a server object for workers in distributed training. The program doesn't hang anymore and training seems normal now.
It would be great if someone could comment on whether this is a proper fix.
I met the same problem in the current version tensor2tensor and fixed it by @jinliangwei 's method. Could tensor2tensor team fix it officially?
I met the same problem as well. interestingly tho, estimator are supposed to handle the start_std_server in tf.estimator.training but for some reasons, it is not able to in current t2t.
@harvey1994 See here for my patch that gets distributed tensor2tensor working: https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866
- self._server = server
where was self._server used? can you share the complete file?