Jeff Rasley

Results 85 comments of Jeff Rasley

Hi @chrjxj, can you try setting `"stage": 1,` in your config json to `"stage": 2,`? I want to confirm if your issue occurs with both stages of zero. I am...

Actually @chrjxj can you set these both to false in your config? I suspect this will fix your issues. ```json "contiguous_gradients": false, "overlap_comm": false, ```

@chrjxj, can you provide the stack trace for the new error message?

@antoiloui, are you also seeing this error? Can you share the deepspeed version you are using and the stack trace? Did you also try turning off `contiguous_gradients` and `overlap_comm`?

Gotcha, I see. Thank you @antoiloui. What version of deepspeed are you running? Is it possible to provide a repro for this error that you're seeing?

@tomekrut are you able to sign the contributor license agreement? Would love to merge your contribution soon! :)

> Hi @jeffra, merge failed for test_pipe.py line 246, in _helper: > > ``` > assert rel_diff(base_avg, test_avg) < 0.03 > AssertionError: assert 0.03816467364302331 < 0.03 > + where 0.03816467364302331...

It appears this fixed the previous issue in one spot. However, now I am seeing the same error in a new assert: https://github.com/microsoft/DeepSpeed/blob/da71a8975d7387c903c32abd4ec0ff6f174980e0/deepspeed/runtime/zero/stage3.py#L2251-L2253 ![image](https://user-images.githubusercontent.com/645595/110510588-349d6480-80b8-11eb-9dc4-706057e17054.png)

> I expected DeepSpeed calls **getstate** of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves...

Thanks @stas00, I see the issue now. I think I know how we can address this, will give it a try. I was able to repro the issue with your...