seq2seq How to use Multiple GPUs?

How to use Multiple GPUs?

Open papajohn opened this issue 7 years ago • 38 comments

I think seq2seq training is not using multiple GPUs. The tokens/sec metric is the same as when I was training on a VM with only 1 GPU or 4 GPUs.

Can someone provide a demo of how to use 4 GPUs on a single machine? All I found in the docs was https://google.github.io/seq2seq/training/#distributed-training . That links to an example of how to use multiple devices using tf.device and how to use a cluster with tf.learn, but I couldn't figure out how to proceed with either approach. Thanks!

Running python -m bin.train as specified in https://google.github.io/seq2seq/nmt/ ...

Four devices are found (from logs):

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y N N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   N Y N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   N N Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   N N N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: a370:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 9f8e:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: b265:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 8743:00:00.0)

Memory is allocated to all 4, but only one GPU has non-zero utilization.

$ nvidia-smi 
Tue Mar 14 19:42:15 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 8743:00:00.0     Off |                    0 |
| N/A   50C    P0    74W / 149W |  10363MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 9F8E:00:00.0     Off |                    0 |
| N/A   78C    P0    67W / 149W |  10363MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | A370:00:00.0     Off |                    0 |
| N/A   74C    P0    94W / 149W |  10402MiB / 11439MiB |     46%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | B265:00:00.0     Off |                    0 |
| N/A   62C    P0    64W / 149W |  10363MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Mar 14 '17 21:03 papajohn

There is nothing about GPU device placement hardcoded, so Tensorflow should handle the device placement. I usually train with 1 GPU only (but multiple workers), so I haven't tried the multi-GPU case.

Can you try running a larger model? It could be that TF decides that the small model is not worth splitting across GPUs. Hopefully TF will put the computation on separate devices. E.g. use nmt_large.yml instead of nmt_small.yml as your config. If it doesn't work, we may need to add tf.device statements to put different RNN layers on different GPUs.

Mar 14 '17 21:03 dennybritz

I was using nmt_large.yml above. Thanks for the quick response!

python -m bin.train \
  --config_paths="./example_configs/nmt_large.yml,./example_configs/train_seq2seq.yml" \
  --model_params "
      vocab_source: $VOCAB_SOURCE
      vocab_target: $VOCAB_TARGET" \
  --input_pipeline_train "
    class: ParallelTextInputPipeline
    params:
      source_files:
        - $TRAIN_SOURCES
      target_files:
        - $TRAIN_TARGETS" \
  --input_pipeline_dev "
    class: ParallelTextInputPipeline
    params:
       source_files:
        - $DEV_SOURCES
       target_files:
        - $DEV_TARGETS" \
  --batch_size 32 \
  --buckets 8,12,16,20,24,28,32,36,40 \
  --train_steps $TRAIN_STEPS \
  --output_dir $MODEL_DIR

Mar 14 '17 21:03 papajohn

By the way, I was just expecting data parallelism — that different batches would be processed on different GPUs. Sounds very similar to your multiple worker set-up, just on one machine. (But I still don't know how to invoke that, if it's even possible.)

Mar 14 '17 21:03 papajohn

I see. I think it is not too common to have data parallelism on the same machine for seq2seq models, but people have found that putting different RNN layers on separate devices speed up things, and we should do that if more than 1 GPU is available.

I will need to look into data parallelism on multiple GPUs. In the best case all we need is instantiate the model multiple times on a separate GPU and average the losses. In that case it may only require a few lines of code change. But maybe it's more complex than that.

Thanks for reporting, I'll take a look at this soon (may take 2-3 days).

Mar 14 '17 21:03 dennybritz

Great! Thanks for taking a look.

I think the use case is reasonably common among academics: launch a fresh 8-GPU instance on some public cloud, install/configure software, download data, & run an experiment.

OpenNMT follows this model, I believe.

Mar 14 '17 21:03 papajohn

Sounds reasonable. Will add this in the next few days.

Mar 14 '17 21:03 dennybritz

@dennybritz, may I ask what's the state of this issue? I'm currently trying to train a conversational dialogue system using this tool and would like to train the model using multiple GPUs since our (desired) model is rather huge, with 4096 hidden units in the encoder/decoder each, and I currently run into OOM problems when the size of my model exceeds 2048 hidden units.

I'm willed to invest some time to help you implementing this feature (if needed). I already took a quick look at the code and couldn't find an obvious place where put the with tf.device(...) wrapper. As far as I understand it, the computational graph must be splitted into multiple parts if I want to leverage the computational power of multiple GPUs (not only RAM). Due to the nature of seq2seq models, this could for example be done by putting the encoder on the first GPU and the decoder on another, right? But I also see some problems, for example does the attention mechanism still work "out of the box" if the encoder is placed on different GPU than the decoder?

Mar 23 '17 09:03 vongruenigen

The original issue of parallelizing training across multiple GPUs through data parallelism is very high on my priority list and I will add that ASAP.

However, that seems different from your issue, @vongruenigen. What you want is split the model across multiple GPUs. You're not going to fit a model that big into a single GPU. Just to do a back of the envelope calculation, if you have a ~30k vocab and 4096 units, then your softmax matrix will be 4096 * 3* 30,000 *32 = 11.7GB alone. So, you're not going to fit that model onto a single GPU, no matter what code you use. To make this work you'd need to modify the model code and use something like sampled softmax, or implement a sharded softmax yourself.

Due to the nature of seq2seq models, this could for example be done by putting the encoder on the first GPU and the decoder on another, right? But I also see some problems, for example does the attention mechanism still work "out of the box" if the encoder is placed on different GPU than the decoder?

It will still work, but it's not going to help you. The vast majority of parameters/memory are usually in the softmax and embeddings/inputs. That's what you need to split (or use an alternative) and there is no "obvious" way to do that, other than maybe using the sampled_softmax_loss in Tensorflow. I haven't used that myself though, and it will only help you for training, not inference.

Mar 23 '17 18:03 dennybritz

@dennybritz, I was aware that a large number of parameters is placed in the softmax, but I didn't realize that it's that huge. I'm going to investigate into using sampled/sharded softmax and try to find a solution. Thanks a lot for the quick response and the clarifying explanation!

Mar 23 '17 23:03 vongruenigen

Distributed Training is supported out of the box using tf.learn. Cluster Configurations can be specified using the TF_CONFIG environment variable, which is parsed by the RunConfig. Refer to the Distributed Tensorflow Guide for more information.

Any example of how this works?

Mar 27 '17 20:03 skyw

Any example of how this works?

For a general introduction to distributed training settings check out the Tensorflow tutorial: https://www.tensorflow.org/deploy/distributed

I haven't seen any example of using TF_CONFIG, but check out the documentation in this file: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py

So instead of needing to change the code I believe you should be able to set all required options via the environment variable.

Mar 28 '17 17:03 dennybritz

Hi @dennybritz

any news on this topic? I was trying to train a nmt_large model on 8 GPUs machine but I confirm that only one was actually used.

Here's the output of the nvidia-smi command:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:17.0     Off |                    0 |
| N/A   70C    P0    75W / 149W |  10417MiB / 11439MiB |     71%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:00:18.0     Off |                    0 |
| N/A   52C    P0    81W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:00:19.0     Off |                    0 |
| N/A   63C    P0    65W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:00:1A.0     Off |                    0 |
| N/A   55C    P0    79W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:00:1B.0     Off |                    0 |
| N/A   65C    P0    64W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:00:1C.0     Off |                    0 |
| N/A   50C    P0    77W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:00:1D.0     Off |                    0 |
| N/A   66C    P0    67W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   54C    P0    81W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2316    C   python                                       10407MiB |
|    1      2316    C   python                                       10368MiB |
|    2      2316    C   python                                       10368MiB |
|    3      2316    C   python                                       10368MiB |
|    4      2316    C   python                                       10368MiB |
|    5      2316    C   python                                       10368MiB |
|    6      2316    C   python                                       10368MiB |
|    7      2316    C   python                                       10368MiB |
+-----------------------------------------------------------------------------+

btw, it seems that tensorflow is actually using the memory of all the GPUs, but only one of them is actually used. Is this something expected?

Apr 05 '17 09:04 davidecaroselli

Interesting...

Apr 12 '17 20:04 Boggartfly

@davidecaroselli I have the same problem.

Apr 13 '17 06:04 wolfshow

@dennybritz : wanted to know if there are any updates on this.

May 10 '17 15:05 hrishikeshvganu

have the same issue.

May 11 '17 17:05 kirtisbhandari

Is there any update or ideas ? I also want to train a model with multiple gpu. It seem @dennybritz is busy with other stuffs.

May 15 '17 13:05 liyi193328

@davidecaroselli @wolfshow face the same problems. How do you smart gays solve the problem? much thanks

May 16 '17 03:05 liyi193328

waiting

Jun 28 '17 11:06 NingHongKe

I would recommend the tensor2tensor library, support of multiple gpus is working pretty well: https://github.com/tensorflow/tensor2tensor

Jun 28 '17 11:06 stefan-it

@davidecaroselli About using all GPU memory problem, TF provides gpu_options.allow_growth option on session config. If it's True, TF will start with small memory & allocate more when it requires. If it's False (default), TF will allocate all of the memory at the beginning. That's why you have seen all of your GPU mem is allocated. Ref: https://www.tensorflow.org/tutorials/using_gpu

I don't use seq2seq yet, but look at its bin/train.py, I found the flag named gpu_allow_growth which actually provides value for the original gpu_options.allow_growth option. It's clearly set to False as default. I guess that you can set this flag to True to request TF to allocate memory on demand.

Jul 02 '17 02:07 nptdat

@nptdat In fact, that doesn't solve those problems. I think the only way make fully use of the gpu is 1. data parallelism 2. allocate each gpu to each layer/some layers manually. However, this library seems to be abandoned....

Jul 04 '17 05:07 yanghoonkim

I see. I think it is not too common to have data parallelism on the same machine for seq2seq models, but people have found that putting different RNN layers on separate devices speed up things, and we should do that if more than 1 GPU is available.

But The results page said that @dennybritz used 8 gpus: https://google.github.io/seq2seq/results/

Jul 04 '17 05:07 yanghoonkim

@ad26kt Yeah, I just mentioned about all memory allocation problem, not about how to make all GPU work.

Jul 04 '17 07:07 nptdat

guys the answer to all your problems is cudann https://developer.nvidia.com/cudnn install cudnn from above link install instructions https://stackoverflow.com/questions/42013316/after-building-tensorflow-from-source-seeing-libcudart-so-and-libcudnn-errors

MY TOKENS AFTER I USE THIS: 2017-07-12 15:42:33.111509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 2017-07-12 15:42:33.111533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y Y Y Y 2017-07-12 15:42:33.111539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: Y Y Y Y 2017-07-12 15:42:33.111543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2: Y Y Y Y 2017-07-12 15:42:33.111547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3: Y Y Y Y

BUT AFTER TRAINING THE MODEL, I CAN SEE THE UTILIZATION ONLY FOR ONE GPU THAT MEANS IT IS USING 4 GPUS WHILE TRAINING BUT AFTER TRAINING IT IS JUST COMING BACK TO ONE GPU WE MUST QUERY NVIDIA-SMI WHILE WE TRAIN USING DIFFERENT CONNECTION I FEEL I WILL TRY AND UPDATE

Jul 12 '17 15:07 imranshaikmuma

I used MXNET and solved the issue

Jul 19 '17 18:07 imranshaikmuma

@ad26kt No, you can use data parallelism too, in TensorFlow. Refer to the following cifar10 example provided.

As @nptdat mentioned, I also suspect that allow_growth is the reason for using up all the memory available. Even if you are using only a single GPU model, tensorflow by default allocates full memory on all the GPUs it can see.

If you are not aware of this previously, visibility of GPUs to a certain application can be controlled by prepending the run command with 'CUDA_VISIBLE_DEVICES=<gpu_numbers_to_be_made_visible>'.

Jul 20 '17 19:07 sampathchanda

@sampathchanda I run my model with multi GPUs and data parallelism but all GPU's memory is located while only one GPU is used to calculate. And I also run code at cifar10 example but it is same as situation describing above. Can you explain why?

Aug 14 '17 07:08 DucVuMinh

@DucVuMinh Tensorflow by default uses memory power of all GPUs as it allocates maximum memory for your job but not processing speed. To utilize the processing power of all GPUs as well you need to specify tf.device statements where ever you want to do parallel processing in your code. In Tensorflow, you have to manually assign devices on your own and also calculate the overall gradients by collecting output from all devices on your own. But MXNET does this thing automatically and you just need to specify CONTEXT statement indicating list of GPUs available. You dont have to calculate the average loss of your model by yourself. It will do it on its own. Let me know if you have any more questions

Aug 14 '17 12:08 imranshaikmuma

@imranshaikmuma In my model I also use tf.device statements to do parallel processing. I implement as scenario describing in cifar10 multi gpu train but when training I see that only one GPU is using. And when I run the example: cifar10 multi gpu train, it still use only one GPU while all memory of all GPUs are located.

Aug 14 '17 14:08 DucVuMinh

seq2seq seq2seq copied to clipboard

How to use Multiple GPUs?

seq2seq
seq2seq copied to clipboard