THRED
THRED copied to clipboard
GPU-Util is low When use multi-GPUs
Hello,
I want to train on multi-GPUs, and I try 8, 4 and 2 gpus. But the GPU-Util of some gpus are low, almost 0%. An epoch training time on 8 gpus is almost 20 minutes longer than on a single gpu.
Your code sets the GPU default num as 4. But when I try 4 cards, there is also one card's GPU-Util always 0%. There is no 0% GPU-Util on the two cards, but the GPU-Util of one of the cards is still 20%.
This is GPU Usage when training on 4 cards:
I am not very clear about shard. I want to ask whether need to modify the code to train on multi-GPUs and accelerate the training ?
Looking forward to your reply!
You mentioned you ran the code with 1 or 2 GPUs. Did you have this problem in those runs too? I suggest turning on log_device
in the config file and compare the single GPU run with 4/8 GPUs run.
I haven't had this problem before, although GPU-util was around 50-60% for all GPUs.
Thanks for your reply!
-
The GPU-util was 70-80% when run with 1 GPU. And it was 50% and 20% respectively when run with 2 GPUs. But there is always a gpu which GPU-util is 0% all the time. I turn on
log_device
to get the device mapping, and I have sent you an email. -
Moreover, I also wanna ask whether your experiment results in paper are averaged over 3 datasets(3/4/5 turn Reddit)? Because I run all epochs but the result is different from the paper. Could you please provide your results on each dataset?
Sorry for the late reply.
-
Have you set
CUDA_VISIBLE_DEVICES
? Based on the log you sent, no tensor was assigned to one of the GPUs. -
All the results in the paper are reported based on the 3-turn dataset.