DeepSpeed
DeepSpeed copied to clipboard
[Deepspeed stage-3 student+teacher crash]
Hi, I have a 1.5 B param GPT-XL pretrained teacher network in fp16 with requires_grad=False. The student network is a small GPT with 142 M params. I use pytorch lightning and in train step I first call teacher then student. But the build_net method returns student network so optimizer should contain only student weights.
I managed to use deepspeed 2, but deepspeed 3 crashes.
Is there any way to partition weights of student only, will deepspeed stage 3 partition weights of teacher too?
For the future I am interested in reducing memory footprint of teacher, can deepspeed be used to partition teacher weights in this case? I'd really appeciate your guidance, thanks!
In my case, when using zero3 and zero.Init in a distillation scenario, it has been observed that a memory leak can occur.
@andrasiani, yes, it is possible to use zero3 for only teacher, or only student, or both. Additionally, you can have separate deepspeed engines for the different models. I am not sure if the pytorch lightning integration is exposing these features to client code. If you share more details of your code, I will be able to provide more suggestions.
In my case, when using zero3 and zero.Init in a distillation scenario, it has been observed that a memory leak can occur.
@Quan-Sun, do you mind opening an issue regarding this memory leak so it can be fixed? Thanks!
hi @tjruwase, have opened an issue #3286
@andrasiani, do you still need this issue opened?
No, I managed to fix it, thanks.
User error, closing.