DeepSpeed [Deepspeed stage-3 student+teacher crash]

[Deepspeed stage-3 student+teacher crash]

Open andrasiani opened this issue 2 years ago • 4 comments

Hi, I have a 1.5 B param GPT-XL pretrained teacher network in fp16 with requires_grad=False. The student network is a small GPT with 142 M params. I use pytorch lightning and in train step I first call teacher then student. But the build_net method returns student network so optimizer should contain only student weights.

I managed to use deepspeed 2, but deepspeed 3 crashes.

Is there any way to partition weights of student only, will deepspeed stage 3 partition weights of teacher too?

For the future I am interested in reducing memory footprint of teacher, can deepspeed be used to partition teacher weights in this case? I'd really appeciate your guidance, thanks!

Apr 10 '23 13:04 andrasiani

In my case, when using zero3 and zero.Init in a distillation scenario, it has been observed that a memory leak can occur.

Apr 17 '23 07:04 Quan-Sun

@andrasiani, yes, it is possible to use zero3 for only teacher, or only student, or both. Additionally, you can have separate deepspeed engines for the different models. I am not sure if the pytorch lightning integration is exposing these features to client code. If you share more details of your code, I will be able to provide more suggestions.

Apr 17 '23 15:04 tjruwase

In my case, when using zero3 and zero.Init in a distillation scenario, it has been observed that a memory leak can occur.

@Quan-Sun, do you mind opening an issue regarding this memory leak so it can be fixed? Thanks!

Apr 17 '23 15:04 tjruwase

hi @tjruwase, have opened an issue #3286

Apr 18 '23 08:04 Quan-Sun

@andrasiani, do you still need this issue opened?

May 15 '23 15:05 tjruwase

No, I managed to fix it, thanks.

May 16 '23 18:05 andrasiani

User error, closing.

May 22 '23 15:05 samadejacobs

DeepSpeed DeepSpeed copied to clipboard

[Deepspeed stage-3 student+teacher crash]

DeepSpeed
DeepSpeed copied to clipboard