seq2seq
seq2seq copied to clipboard
How to use Multiple GPUs?
I think seq2seq
training is not using multiple GPUs. The tokens/sec
metric is the same as when I was training on a VM with only 1 GPU or 4 GPUs.
Can someone provide a demo of how to use 4 GPUs on a single machine? All I found in the docs was https://google.github.io/seq2seq/training/#distributed-training . That links to an example of how to use multiple devices using tf.device
and how to use a cluster with tf.learn
, but I couldn't figure out how to proceed with either approach. Thanks!
Running python -m bin.train
as specified in https://google.github.io/seq2seq/nmt/ ...
Four devices are found (from logs):
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y N N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: N Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: N N Y N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: N N N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: a370:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 9f8e:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: b265:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 8743:00:00.0)
Memory is allocated to all 4, but only one GPU has non-zero utilization.
$ nvidia-smi
Tue Mar 14 19:42:15 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 8743:00:00.0 Off | 0 |
| N/A 50C P0 74W / 149W | 10363MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 9F8E:00:00.0 Off | 0 |
| N/A 78C P0 67W / 149W | 10363MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | A370:00:00.0 Off | 0 |
| N/A 74C P0 94W / 149W | 10402MiB / 11439MiB | 46% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | B265:00:00.0 Off | 0 |
| N/A 62C P0 64W / 149W | 10363MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
There is nothing about GPU device placement hardcoded, so Tensorflow should handle the device placement. I usually train with 1 GPU only (but multiple workers), so I haven't tried the multi-GPU case.
Can you try running a larger model? It could be that TF decides that the small model is not worth splitting across GPUs. Hopefully TF will put the computation on separate devices. E.g. use nmt_large.yml
instead of nmt_small.yml
as your config. If it doesn't work, we may need to add tf.device
statements to put different RNN layers on different GPUs.
I was using nmt_large.yml
above. Thanks for the quick response!
python -m bin.train \
--config_paths="./example_configs/nmt_large.yml,./example_configs/train_seq2seq.yml" \
--model_params "
vocab_source: $VOCAB_SOURCE
vocab_target: $VOCAB_TARGET" \
--input_pipeline_train "
class: ParallelTextInputPipeline
params:
source_files:
- $TRAIN_SOURCES
target_files:
- $TRAIN_TARGETS" \
--input_pipeline_dev "
class: ParallelTextInputPipeline
params:
source_files:
- $DEV_SOURCES
target_files:
- $DEV_TARGETS" \
--batch_size 32 \
--buckets 8,12,16,20,24,28,32,36,40 \
--train_steps $TRAIN_STEPS \
--output_dir $MODEL_DIR
By the way, I was just expecting data parallelism — that different batches would be processed on different GPUs. Sounds very similar to your multiple worker set-up, just on one machine. (But I still don't know how to invoke that, if it's even possible.)
I see. I think it is not too common to have data parallelism on the same machine for seq2seq models, but people have found that putting different RNN layers on separate devices speed up things, and we should do that if more than 1 GPU is available.
I will need to look into data parallelism on multiple GPUs. In the best case all we need is instantiate the model multiple times on a separate GPU and average the losses. In that case it may only require a few lines of code change. But maybe it's more complex than that.
Thanks for reporting, I'll take a look at this soon (may take 2-3 days).
Great! Thanks for taking a look.
I think the use case is reasonably common among academics: launch a fresh 8-GPU instance on some public cloud, install/configure software, download data, & run an experiment.
OpenNMT follows this model, I believe.
Sounds reasonable. Will add this in the next few days.
@dennybritz, may I ask what's the state of this issue? I'm currently trying to train a conversational dialogue system using this tool and would like to train the model using multiple GPUs since our (desired) model is rather huge, with 4096 hidden units in the encoder/decoder each, and I currently run into OOM problems when the size of my model exceeds 2048 hidden units.
I'm willed to invest some time to help you implementing this feature (if needed). I already took a quick look at the code and couldn't find an obvious place where put the with tf.device(...)
wrapper. As far as I understand it, the computational graph must be splitted into multiple parts if I want to leverage the computational power of multiple GPUs (not only RAM). Due to the nature of seq2seq models, this could for example be done by putting the encoder on the first GPU and the decoder on another, right? But I also see some problems, for example does the attention mechanism still work "out of the box" if the encoder is placed on different GPU than the decoder?
The original issue of parallelizing training across multiple GPUs through data parallelism is very high on my priority list and I will add that ASAP.
However, that seems different from your issue, @vongruenigen. What you want is split the model across multiple GPUs. You're not going to fit a model that big into a single GPU. Just to do a back of the envelope calculation, if you have a ~30k vocab and 4096 units, then your softmax matrix will be 4096 * 3* 30,000 *32 = 11.7GB
alone. So, you're not going to fit that model onto a single GPU, no matter what code you use. To make this work you'd need to modify the model code and use something like sampled softmax, or implement a sharded softmax yourself.
Due to the nature of seq2seq models, this could for example be done by putting the encoder on the first GPU and the decoder on another, right? But I also see some problems, for example does the attention mechanism still work "out of the box" if the encoder is placed on different GPU than the decoder?
It will still work, but it's not going to help you. The vast majority of parameters/memory are usually in the softmax and embeddings/inputs. That's what you need to split (or use an alternative) and there is no "obvious" way to do that, other than maybe using the sampled_softmax_loss
in Tensorflow. I haven't used that myself though, and it will only help you for training, not inference.
@dennybritz, I was aware that a large number of parameters is placed in the softmax, but I didn't realize that it's that huge. I'm going to investigate into using sampled/sharded softmax and try to find a solution. Thanks a lot for the quick response and the clarifying explanation!
Distributed Training is supported out of the box using tf.learn. Cluster Configurations can be specified using the TF_CONFIG environment variable, which is parsed by the RunConfig. Refer to the Distributed Tensorflow Guide for more information.
Any example of how this works?
Any example of how this works?
For a general introduction to distributed training settings check out the Tensorflow tutorial: https://www.tensorflow.org/deploy/distributed
I haven't seen any example of using TF_CONFIG
, but check out the documentation in this file: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py
So instead of needing to change the code I believe you should be able to set all required options via the environment variable.
Hi @dennybritz
any news on this topic? I was trying to train a nmt_large
model on 8 GPUs machine but I confirm that only one was actually used.
Here's the output of the nvidia-smi
command:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:17.0 Off | 0 |
| N/A 70C P0 75W / 149W | 10417MiB / 11439MiB | 71% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:00:18.0 Off | 0 |
| N/A 52C P0 81W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:00:19.0 Off | 0 |
| N/A 63C P0 65W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:00:1A.0 Off | 0 |
| N/A 55C P0 79W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:00:1B.0 Off | 0 |
| N/A 65C P0 64W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:00:1C.0 Off | 0 |
| N/A 50C P0 77W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:00:1D.0 Off | 0 |
| N/A 66C P0 67W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 54C P0 81W / 149W | 10378MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2316 C python 10407MiB |
| 1 2316 C python 10368MiB |
| 2 2316 C python 10368MiB |
| 3 2316 C python 10368MiB |
| 4 2316 C python 10368MiB |
| 5 2316 C python 10368MiB |
| 6 2316 C python 10368MiB |
| 7 2316 C python 10368MiB |
+-----------------------------------------------------------------------------+
btw, it seems that tensorflow is actually using the memory of all the GPUs, but only one of them is actually used. Is this something expected?
Interesting...
@davidecaroselli I have the same problem.
@dennybritz : wanted to know if there are any updates on this.
have the same issue.
Is there any update or ideas ? I also want to train a model with multiple gpu. It seem @dennybritz is busy with other stuffs.
@davidecaroselli @wolfshow face the same problems. How do you smart gays solve the problem? much thanks
waiting
I would recommend the tensor2tensor
library, support of multiple gpus is working pretty well: https://github.com/tensorflow/tensor2tensor
@davidecaroselli About using all GPU memory problem, TF provides gpu_options.allow_growth
option on session config. If it's True, TF will start with small memory & allocate more when it requires. If it's False (default), TF will allocate all of the memory at the beginning. That's why you have seen all of your GPU mem is allocated.
Ref: https://www.tensorflow.org/tutorials/using_gpu
I don't use seq2seq yet, but look at its bin/train.py, I found the flag named gpu_allow_growth
which actually provides value for the original gpu_options.allow_growth
option. It's clearly set to False as default. I guess that you can set this flag to True to request TF to allocate memory on demand.
@nptdat In fact, that doesn't solve those problems. I think the only way make fully use of the gpu is 1. data parallelism 2. allocate each gpu to each layer/some layers manually. However, this library seems to be abandoned....
I see. I think it is not too common to have data parallelism on the same machine for seq2seq models, but people have found that putting different RNN layers on separate devices speed up things, and we should do that if more than 1 GPU is available.
But The results page said that @dennybritz used 8 gpus: https://google.github.io/seq2seq/results/
@ad26kt Yeah, I just mentioned about all memory allocation problem, not about how to make all GPU work.
guys the answer to all your problems is cudann https://developer.nvidia.com/cudnn install cudnn from above link install instructions https://stackoverflow.com/questions/42013316/after-building-tensorflow-from-source-seeing-libcudart-so-and-libcudnn-errors
MY TOKENS AFTER I USE THIS: 2017-07-12 15:42:33.111509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 2017-07-12 15:42:33.111533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y Y Y Y 2017-07-12 15:42:33.111539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: Y Y Y Y 2017-07-12 15:42:33.111543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2: Y Y Y Y 2017-07-12 15:42:33.111547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3: Y Y Y Y
BUT AFTER TRAINING THE MODEL, I CAN SEE THE UTILIZATION ONLY FOR ONE GPU THAT MEANS IT IS USING 4 GPUS WHILE TRAINING BUT AFTER TRAINING IT IS JUST COMING BACK TO ONE GPU WE MUST QUERY NVIDIA-SMI WHILE WE TRAIN USING DIFFERENT CONNECTION I FEEL I WILL TRY AND UPDATE
UPDATE:
NO I AM WRONG
one gpu is being used i guess
netsvs123@instance-1:~$ nvidia-smi
Wed Jul 12 16:47:27 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:04.0 Off | 0 |
| N/A 57C P0 76W / 149W | 10915MiB / 11439MiB | 18% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:00:05.0 Off | 0 |
| N/A 71C P0 76W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:00:06.0 Off | 0 |
| N/A 49C P0 59W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:00:07.0 Off | 0 |
| N/A 68C P0 72W / 149W | 10871MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2006 G /usr/lib/xorg/Xorg 15MiB |
| 0 2398 C python3 10894MiB |
| 1 2398 C python3 10867MiB |
| 2 2398 C python3 10867MiB |
| 3 2398 C python3 10865MiB |
+-----------------------------------------------------------------------------+
netsvs123@instance-1:~$ nvidia-smi
Wed Jul 12 16:47:44 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:04.0 Off | 0 |
| N/A 60C P0 98W / 149W | 10915MiB / 11439MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:00:05.0 Off | 0 |
| N/A 73C P0 76W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:00:06.0 Off | 0 |
| N/A 50C P0 60W / 149W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:00:07.0 Off | 0 |
| N/A 69C P0 72W / 149W | 10871MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 2006 G /usr/lib/xorg/Xorg 15MiB | | 0 2398 C python3 10894MiB | | 1 2398 C python3 10867MiB | | 2 2398 C python3 10867MiB | | 3 2398 C python3 10865MiB | +-----------------------------------------------------------------------------+
I used MXNET and solved the issue
@ad26kt No, you can use data parallelism too, in TensorFlow. Refer to the following cifar10 example provided.
As @nptdat mentioned, I also suspect that allow_growth is the reason for using up all the memory available. Even if you are using only a single GPU model, tensorflow by default allocates full memory on all the GPUs it can see.
If you are not aware of this previously, visibility of GPUs to a certain application can be controlled by prepending the run command with 'CUDA_VISIBLE_DEVICES=<gpu_numbers_to_be_made_visible>'.
@sampathchanda I run my model with multi GPUs and data parallelism but all GPU's memory is located while only one GPU is used to calculate. And I also run code at cifar10 example but it is same as situation describing above. Can you explain why?
@DucVuMinh Tensorflow by default uses memory power of all GPUs as it allocates maximum memory for your job but not processing speed. To utilize the processing power of all GPUs as well you need to specify tf.device statements where ever you want to do parallel processing in your code. In Tensorflow, you have to manually assign devices on your own and also calculate the overall gradients by collecting output from all devices on your own. But MXNET does this thing automatically and you just need to specify CONTEXT statement indicating list of GPUs available. You dont have to calculate the average loss of your model by yourself. It will do it on its own. Let me know if you have any more questions
@imranshaikmuma In my model I also use tf.device statements to do parallel processing. I implement as scenario describing in cifar10 multi gpu train but when training I see that only one GPU is using. And when I run the example: cifar10 multi gpu train, it still use only one GPU while all memory of all GPUs are located.