THRED GPU-Util is low When use multi-GPUs

GPU-Util is low When use multi-GPUs

Open LTlitong opened this issue 4 years ago • 3 comments

Hello,

I want to train on multi-GPUs, and I try 8, 4 and 2 gpus. But the GPU-Util of some gpus are low, almost 0%. An epoch training time on 8 gpus is almost 20 minutes longer than on a single gpu.

Your code sets the GPU default num as 4. But when I try 4 cards, there is also one card's GPU-Util always 0%. There is no 0% GPU-Util on the two cards, but the GPU-Util of one of the cards is still 20%. This is GPU Usage when training on 4 cards:

I am not very clear about shard. I want to ask whether need to modify the code to train on multi-GPUs and accelerate the training ?

Looking forward to your reply！

May 13 '20 08:05 LTlitong

You mentioned you ran the code with 1 or 2 GPUs. Did you have this problem in those runs too? I suggest turning on log_device in the config file and compare the single GPU run with 4/8 GPUs run.

I haven't had this problem before, although GPU-util was around 50-60% for all GPUs.

May 14 '20 01:05 ehsk

Thanks for your reply！

The GPU-util was 70-80% when run with 1 GPU. And it was 50% and 20% respectively when run with 2 GPUs. But there is always a gpu which GPU-util is 0% all the time. I turn on log_device to get the device mapping, and I have sent you an email.
Moreover, I also wanna ask whether your experiment results in paper are averaged over 3 datasets(3/4/5 turn Reddit)? Because I run all epochs but the result is different from the paper. Could you please provide your results on each dataset?

May 14 '20 13:05 LTlitong

Sorry for the late reply.

Have you set CUDA_VISIBLE_DEVICES? Based on the log you sent, no tensor was assigned to one of the GPUs.
All the results in the paper are reported based on the 3-turn dataset.

May 21 '20 16:05 ehsk

THRED THRED copied to clipboard

GPU-Util is low When use multi-GPUs

THRED
THRED copied to clipboard