Uni-Core icon indicating copy to clipboard operation
Uni-Core copied to clipboard

EMA's param and new_param on different devices when using multiple GPUs

Open lucifer1004 opened this issue 11 months ago • 1 comments

I was training Uni-Mol using Uni-Core, on multiple GPUs (one node). However, I met the following error message:

    diff = self.param - new_param
    diff = self.param - new_param
             diff = self.param - new_param
  ~ ~ ~ ~ ~ ~ ~ ~ ~  ~  ~ ^~ ~~ ~~ ~~~~     ~~ diff = self.param - new_param~~
~~~~diff = self.param - new_param ~~~
 ^~
~~ ~~ RuntimeError~~ : ~~  Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!~~
~~  ~~  ~~  ~~  ~~  ~
~ ^~ ~RuntimeError~ ~: ~ ~Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cpu!~~~
~~~~~~~~~~~~~~~~~~^~
~~~~RuntimeError~~: ~^Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!~
~~~~~~~~~~
~~~RuntimeError~: ~Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cpu!
    diff = self.param - new_param
           ~~~~~~~~~~~^~~~~~~~~~    ~diff = self.param - new_param

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
           ~~~~~~~~~~~^~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cpu!
    diff = self.param - new_param
           ~~~~~~~~~~~^~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cpu!

The direct cause is clear.

https://github.com/dptech-corp/Uni-Core/blob/ec396a79da8d9ee1b2b93a07b34675ac71c92fc7/unicore/ema.py#L47

Assumes self.param and new_param are on the same device, but they are not.

A workaround is to manually move them together in the update() function. However, that might hide the root cause, which is worth digging.

lucifer1004 avatar Feb 28 '24 17:02 lucifer1004

I encountered the same problem when training UniMol+.

My Solution:

I fixed this problem by moving the initialization of self.param to CUDA within the flatten_parameters method in the ema.py file. Change from https://github.com/dptech-corp/Uni-Core/blob/ec396a79da8d9ee1b2b93a07b34675ac71c92fc7/unicore/ema.py#L39

to flatten_param = torch.nn.Parameter(flatten_param).cuda()

However, I'm not entirely certain if this is the most appropriate way to address the issue. It would be helpful to get feedback from the maintainers or the community to ensure that this fix is correct and doesn't introduce any unintended consequences.

xwxztq avatar Mar 20 '24 07:03 xwxztq

this is fixed in https://github.com/dptech-corp/Uni-Core/commit/89fcb4b85165c4171916bee2e616be2826dd7585

guolinke avatar Jun 24 '24 05:06 guolinke