coot-videotext icon indicating copy to clipboard operation
coot-videotext copied to clipboard

[BUG] multi gpu training without --single_gpu

Open menatallh opened this issue 4 years ago • 6 comments

Describe the bug Problem with multi gpu training when i remove --single gpu

Expected behavior it detects the available gpus image

Screenshots image

System Info:

  • OS: [e.g. Ubuntu 18.04]
  • Python version [e.g. 3.8.6]
  • PyTorch version [e.g. 1.7.0+cu11]

Additional context Add any other context about the problem here.

menatallh avatar Feb 26 '21 21:02 menatallh

If you have solved it, please consider posting your fix for others.

simon-ging avatar Feb 28 '21 12:02 simon-ging

Did you solve this problem?

menggehe avatar Apr 08 '21 13:04 menggehe

Does it still happen? If yes please post a complete bug report: Which command do you input, the complete error message, output of system command "nvidia-smi", which system / python / pytorch version. Then I will look into it.

simon-ging avatar Apr 09 '21 12:04 simon-ging

command : image

message: image image

output of system command "nvidia-smi": image

System Info: OS: Ubuntu 18.04 Python version 3.8.5 PyTorch version 1.8.1

menggehe avatar Apr 09 '21 13:04 menggehe

I change some code in utils_torch.py: 1. before: image after: image

But the model still uses only one GPU device:0.

menggehe avatar Apr 09 '21 13:04 menggehe

I will check this problem, it should be possible to train on multiple GPUs. Other than that, unless you increase the model size or batch size, a single 12GB GPU is more than enough to train retrieval

simon-ging avatar Apr 09 '21 13:04 simon-ging