ShijieZZZZ

Results 14 comments of ShijieZZZZ

Hi @griff4692, your deepspeed repo structure looks odd. The line that throws error ```File "/home/griffin/.local/lib/python3.8/site-packages/op_builder/builder.py", line 230, in load``` Should not it be in ```/home/griffin/.local/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py``` instead? My guess is your...

Hi @griff4692, I will close this issue for now. Feel free to re-open if you're still seeing it.

Hi @TingchenFu, @mayank31398, @linhdvu14, @ZeyiLiao, @alexanderswerdlow, @rohitdwivedula, could you please try [this PR](https://github.com/microsoft/DeepSpeed/pull/3033) to see if it fixes this issue.

Hi @leiwen83, @lw3259111, thank you for report this issue. Do you have a small example to reproduce this?

Hi @mayank31398, thanks for sharing this. I made one more change [here ](https://github.com/microsoft/DeepSpeed/pull/3149/files#:~:text=for%20share_param%20in%20%5B*param_names%2C%20*buffer_names%5D) when populating shared_params.

Looks like similar issue observed [here](https://github.com/microsoft/DeepSpeed/pull/3295#issuecomment-1513800129). Will take a look.

Hi @shaankhosla, with this [PR](https://github.com/microsoft/DeepSpeed/pull/3149) merged, could you try again?

Hi @linyubupa, could you describe more details about reproducing this issue? Especially how you measured _cpu memory used_ and _model_size_

Hi @linyubupa, I am not able to run your code. Do you have a smaller test (but complete setup) for reproducing? Your suspicion on `mp.spawn` makes sense. See this https://github.com/pytorch/pytorch/issues/38645....