botbw comments

Results 39 comments of


                                            botbw

[BUG]: Low_Level_Zero plugin crashes with LoRA

Hey @Fallqs thanks for reporting the bug and I will look into this. Btw will it be possible to share the code you are using or a min repro for...

[BUG]: Low_Level_Zero plugin crashes with LoRA

hey @281LinChenjian , Regarding the problem you've got: ### Code snippet 1 [here](https://github.com/hpcaitech/ColossalAI/blob/8020f4263095373e4c7ad1b15e54b966a8ccb683/colossalai/zero/low_level/low_level_optim.py#L601) after optimizer updates the param, it clears the `_grad_store` and you can no longer access the gradient,...

[BUG]: Low_Level_Zero plugin crashes with LoRA

@281LinChenjian I guess you'll have to manually do `torch.distributed.all_gather` For your [reference](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather)

[BUG]: Got nan during backward with zero2

@flymin Thanks for reporting this! Will it be possible to share the config/code snippet you are using? If not, could you try setting `overlap_communication=False` in `LowLevelZeroPlugin` and check if the...

[BUG]: RuntimeError: Failed to replace input_layernorm of type DeepseekV3RMSNorm with FusedRMSNorm with the exception: 'NoneType' object is not callable.

Hey @lgybuaa, this is likely due to the incorrect installation of `apex` as @pbelevich mentioned.

[PROPOSAL]: Does the LowLevelZero Plugin Support Lora, This Code Is Confusing

related issue #5909

Gradients are None after booster.backward

hey @ArnaudFickinger @B-Soul , could you please share the settings of your scripts?

Gradients are None after booster.backward

> My code is related to my own ongoing research, so it is not convenient to share. But I just changed the distributed framework used to Huggingface Accelerate, and gradients...

Gradients are None after booster.backward

> @botbw thank you the [low-level](https://github.com/hpcaitech/ColossalAI/blob/74f4a297342f17e6d9447633cc26ce45db33b59d/tests/test_zero/test_low_level/test_zero1_2.py#L164) snippet is working! By the way which of gemini or low-level should I use for best performance with 1 to 8 A100 GPUs and...

Gradients are None after booster.backward

> @botbw when I define 2 param_groups the id() of the parameters of the second group do not match any keys of optimizer._grad_store._grads_of_params[1] @ArnaudFickinger I guess it's unexpected since each...