autotrain-advanced
autotrain-advanced copied to clipboard
[BUG]System gets stuck when use Muti-gpus
Prerequisites
- [X] I have read the documentation.
- [X] I have checked other issues for similar problems.
Backend
Local
Interface Used
CLI
CLI Command
autotrain llm
--train
--model MediaTek-Research/Breeze-7B-Instruct-64k-v0_1
--data-path dataset/
--project-name transss
--text-column text
--lr 2e-4
--batch-size 32
--epochs 2
--lora-r 16
--lora-alpha 32
--lora-dropout 0.05
--logging-steps 1
--use-peft -
-log "tensorboard"
UI Screenshots & Parameters
No response
Error Logs
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model.
Parameter Offload: Total persistent parameters: 7081984 in 193 params
[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=253440000, NumelOut=253440000, Timeout(ms)=600000) ran for 600155 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=253440000, NumelOut=253440000, Timeout(ms)=600000) ran for 600555 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=253440000, NumelOut=253440000, Timeout(ms)=600000) ran for 600939 milliseconds before timing out.
Additional Information
Hello ! Do you have any idea for this issue? The program always gets stuck at the "creating trainer" stage during execution. I am using A6000*3 on a device. Thank you!
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
can you try updating linux kernel?
This issue is stale because it has been open for 15 days with no activity.
This issue was closed because it has been inactive for 2 days since being marked as stale.