Results 39 comments of botbw

Hey @Fallqs thanks for reporting the bug and I will look into this. Btw will it be possible to share the code you are using or a min repro for...

hey @281LinChenjian , Regarding the problem you've got: ### Code snippet 1 [here](https://github.com/hpcaitech/ColossalAI/blob/8020f4263095373e4c7ad1b15e54b966a8ccb683/colossalai/zero/low_level/low_level_optim.py#L601) after optimizer updates the param, it clears the `_grad_store` and you can no longer access the gradient,...

@281LinChenjian I guess you'll have to manually do `torch.distributed.all_gather` For your [reference](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather)

@flymin Thanks for reporting this! Will it be possible to share the config/code snippet you are using? If not, could you try setting `overlap_communication=False` in `LowLevelZeroPlugin` and check if the...

hey @ArnaudFickinger @B-Soul , could you please share the settings of your scripts?

> My code is related to my own ongoing research, so it is not convenient to share. But I just changed the distributed framework used to Huggingface Accelerate, and gradients...

> @botbw thank you the [low-level](https://github.com/hpcaitech/ColossalAI/blob/74f4a297342f17e6d9447633cc26ce45db33b59d/tests/test_zero/test_low_level/test_zero1_2.py#L164) snippet is working! By the way which of gemini or low-level should I use for best performance with 1 to 8 A100 GPUs and...

> @botbw when I define 2 param_groups the id() of the parameters of the second group do not match any keys of optimizer._grad_store._grads_of_params[1] @ArnaudFickinger I guess it's unexpected since each...