ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ZeRO not Working with SGD Optimizer

Open FrankLeeeee opened this issue 2 years ago • 3 comments

🐛 Describe the bug

ZeRO will keep throwing overflow if used together with momentum SGD in the resnet example. The code works fine with all kinds of amp. Screenshot 2022-06-02 at 4 23 26 PM

Environment

No response

FrankLeeeee avatar Jun 02 '22 08:06 FrankLeeeee

ZeRO is used in the context of ADAM or 2nd order optimizer. Generally, a DNN using SGD does not have memory shortage issues. We can through an error if the user uses SGD for ZeRO.

feifeibear avatar Jun 03 '22 04:06 feifeibear

It is understood that ZeRO is not needed for SGD from the memory perspective, but this overflow might suggest a bug in the current implementation.

FrankLeeeee avatar Jun 03 '22 08:06 FrankLeeeee

I see. We will check it later.

feifeibear avatar Jun 03 '22 09:06 feifeibear

We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 13 '23 04:04 binmakeswell