tacotron
tacotron copied to clipboard
Low GPU usage
I train the model on my two Tesla M40 GPU. When I use nvidia-smi command to check the GPU usage, it always keeps on a low usage, and just one GPU is used.
How can I full use the two GPUs?
I've tried to increase the queue capacity and thread numbers, but it helps little.
@candlewill Are you running it with Python 3 or which Python version? I had the problem that I couldn't use the GPUs and that was, I'm guessing, because I was using Python 2.7, it is in the closed issue #5 .
Try to see if you can find something useful there. Otherwise, if you can solve it, please, let us know how you did.
@basuam The Python I used is Python 3.6.0 with Anaconda 4.3.1 (64-bit), and GPU version TensorFlow (1.1) is used.
When training, the two GPUs are used, but just one for computation.
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 19759 C /home/train01/heyunchao/anaconda3/bin/python 21912MiB |
| 3 19759 C /home/train01/heyunchao/anaconda3/bin/python 21794MiB |
+-----------------------------------------------------------------------------+
Only GPU 2 is used for computation, and the GPU-Util maintains at 0% in a long time.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB On | 0000:02:00.0 Off | 0 |
| N/A 19C P8 18W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M40 24GB On | 0000:03:00.0 Off | 0 |
| N/A 21C P8 17W / 250W | 0MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M40 24GB On | 0000:83:00.0 Off | 0 |
| N/A 37C P0 65W / 250W | 21916MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M40 24GB On | 0000:84:00.0 Off | 0 |
| N/A 32C P0 57W / 250W | 21798MiB / 22939MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
@candlewill I believe there should be a slight modification in the code to allocate both GPUs because it's not the default for tensorflow. I cannot confirm because I've never trained it with more than one GPU but I do believe you must allocate both first manually otherwise tensorflow-gpu simply allocate one GPU. Have you trained other networks before without explicitly declaring which GPUs to use and checked if it has used them all?
To my understanding, maybe this is the reason: If multiple GPUs are not explicitly declared how to allocate, TensorFlow would choose the first GPU for computation as default, but use the memory of all GPUs.
candlewill's explanation is exact. I added train_multi_gpus.py for using multiple gpus.
@basuam @candlewill Would you run and check the file? In my environment (3 * gtx 1080), the time for an epoch has decreased to almost 1/3. But I'm not sure if it's error-free because this is the first time I've written a code for multiple gpus.
@Kyubyong In the current train.py code, the training is completely on CPU (see here ). I commented this line to allow use one GPU.
Then, I tried to compare the time cost of one epoch between train_multi_gpus.py and train.py. I find that, multi GPUs verstion takes a longer time about 220 seconds per epoch, while the single GPU version takes about 110 seconds.
My experiment environment is four Tesla K40m GPUs.
Did you run train_multi_gpus.py?
Yes, It takes a longer time about 220 seconds per epoch.
You changed the value of num_gpus in the hyperparams.py, did you?
Yes, I changed the value into 4.
One possibility is the batch size. If you have 4 gpus, you have to multiply the hp.batch_size by 4 for a fair comparison. If you see the code, mini-batch samples are split into 4 so each is fed in each gpu tower.
@candlewill Oh, and I removed the line of tf.device('/cpu:0'). I forgot to remove it. Thanks.
@candlewill Did you find why the multi gpu version is slower than the single gpu one? For me, the former is definitely way faster than the latter.
I forgot to increase the value of batch_size by multiplying with num_gpus when training.
I am running in a similar issue: GPU usage (AWS p2.xlarge, Deep Learning CUDA 9 Ubuntu AMI) is on average low, while CPU usage is always at peak. Using python 2.7 or 3 doesn't make any difference. It's like the GPU is used for some particular task only which is seldomly invoked.
ubuntu@ip-172-31-13-191:~$ nvidia-smi
Sat Nov 4 08:20:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 75C P0 93W / 149W | 10984MiB / 11439MiB | 50% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10238 C python3 10971MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-13-191:~$ nvidia-smi
Sat Nov 4 08:20:24 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 75C P0 73W / 149W | 10984MiB / 11439MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10238 C python3 10971MiB |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-13-191:~$
.
@candlewill - I am facing same issue of low GPU usage as indicated by @aijanai . Could you please indicate which line did you comment to get full utilization on single GPU? The line which is mentioned by you in your comment is already commented in the code as it stands now and yet the performance is slow, hence the confusion.
candlewill's explanation is exact. I added
train_multi_gpus.pyfor using multiple gpus. @basuam @candlewill Would you run and check the file? In my environment (3 * gtx 1080), the time for an epoch has decreased to almost 1/3. But I'm not sure if it's error-free because this is the first time I've written a code for multiple gpus.
Can you please share train_multi_gpus.py.? It is not available now. And with train.py, I cant train on the GPUs.