botbw comments

Results 39 comments of


                                            botbw

Gradients are None after booster.backward

> @botbw I have written a min repro with a simple network and in this case the keys actually match! I will take a closer look at my code and...

Gradients are None after booster.backward

hey folks, I'm closing this issue since it's been stalled for a while, feel free to reopen or submit a new issue if you still have any doubts

[BUG]: Failed to run lora_finetune.py to fine-tune DS

@mahaocong90 Thanks for reporting this! I tested the script with 24xH20 GPUs (2304 GiB mem in total), and the script works fine on my side (at least for dozens of...

[BUG]: Hybrid Parallel Plugin，zero_stage=1，zero_cpu_offload=true，terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call

Hi @happynaruto, will it be possible to provide the script you ran? Assuming that you are doing some fine-tuning, I tested using `examples/language/llama/benchmark.py` by adding config ```python "qwen": Qwen2Config( hidden_act="silu",...

[BUG]: Unable to figure out how to pass env variables for each node

@Gautam-Rajeev I think master node env vars will be synced to the rest: https://github.com/hpcaitech/ColossalAI/blob/6d676ee0e95d54df90b4ee640dee0e0a198ab8f3/colossalai/cli/launcher/run.py#L280-L287 https://github.com/hpcaitech/ColossalAI/blob/6d676ee0e95d54df90b4ee640dee0e0a198ab8f3/colossalai/cli/launcher/multinode_runner.py#L47-L53 You might want to change the code a bit to allow different NCCL_SOCKET_IFNAME, or simply...

botbw

Gradients are None after booster.backward

Gradients are None after booster.backward

[BUG]: Failed to run lora_finetune.py to fine-tune DS

[BUG]: Hybrid Parallel Plugin，zero_stage=1，zero_cpu_offload=true，terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call

[BUG]: Unable to figure out how to pass env variables for each node

[BUG]: can not save model in pipeline training mode

Datasize metadata consistency with stderr

Datasize metadata consistency with stderr

[checkpoint_io] Fix gather_state_dict_fast