Uni-Core
Uni-Core copied to clipboard
EMA's param and new_param on different devices when using multiple GPUs
I was training Uni-Mol using Uni-Core, on multiple GPUs (one node). However, I met the following error message:
diff = self.param - new_param
diff = self.param - new_param
diff = self.param - new_param
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ^~ ~~ ~~ ~~~~ ~~ diff = self.param - new_param~~
~~~~diff = self.param - new_param ~~~
^~
~~ ~~ RuntimeError~~ : ~~ Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!~~
~~ ~~ ~~ ~~ ~~ ~
~ ^~ ~RuntimeError~ ~: ~ ~Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cpu!~~~
~~~~~~~~~~~~~~~~~~^~
~~~~RuntimeError~~: ~^Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!~
~~~~~~~~~~
~~~RuntimeError~: ~Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cpu!
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:5 and cpu!
diff = self.param - new_param
~~~~~~~~~~~^~~~~~~~~~ ~diff = self.param - new_param
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
~~~~~~~~~~~^~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cpu!
diff = self.param - new_param
~~~~~~~~~~~^~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cpu!
The direct cause is clear.
https://github.com/dptech-corp/Uni-Core/blob/ec396a79da8d9ee1b2b93a07b34675ac71c92fc7/unicore/ema.py#L47
Assumes self.param
and new_param
are on the same device, but they are not.
A workaround is to manually move them together in the update()
function. However, that might hide the root cause, which is worth digging.
I encountered the same problem when training UniMol+.
My Solution:
I fixed this problem by moving the initialization of self.param to CUDA within the flatten_parameters method in the ema.py file. Change from https://github.com/dptech-corp/Uni-Core/blob/ec396a79da8d9ee1b2b93a07b34675ac71c92fc7/unicore/ema.py#L39
to
flatten_param = torch.nn.Parameter(flatten_param).cuda()
However, I'm not entirely certain if this is the most appropriate way to address the issue. It would be helpful to get feedback from the maintainers or the community to ensure that this fix is correct and doesn't introduce any unintended consequences.
this is fixed in https://github.com/dptech-corp/Uni-Core/commit/89fcb4b85165c4171916bee2e616be2826dd7585