autotrain-advanced icon indicating copy to clipboard operation
autotrain-advanced copied to clipboard

[BUG]System gets stuck when use Muti-gpus

Open yengogo opened this issue 1 year ago • 2 comments

Prerequisites

  • [X] I have read the documentation.
  • [X] I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain llm --train
--model MediaTek-Research/Breeze-7B-Instruct-64k-v0_1 --data-path dataset/ --project-name transss --text-column text --lr 2e-4 --batch-size 32 --epochs 2 --lora-r 16 --lora-alpha 32 --lora-dropout 0.05 --logging-steps 1 --use-peft - -log "tensorboard"

UI Screenshots & Parameters

No response

Error Logs

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. Parameter Offload: Total persistent parameters: 7081984 in 193 params [rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=253440000, NumelOut=253440000, Timeout(ms)=600000) ran for 600155 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=253440000, NumelOut=253440000, Timeout(ms)=600000) ran for 600555 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=253440000, NumelOut=253440000, Timeout(ms)=600000) ran for 600939 milliseconds before timing out.

Additional Information

Hello ! Do you have any idea for this issue? The program always gets stuck at the "creating trainer" stage during execution. I am using A6000*3 on a device. Thank you!

yengogo avatar Feb 07 '24 06:02 yengogo

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

can you try updating linux kernel?

abhishekkrthakur avatar Feb 07 '24 09:02 abhishekkrthakur

This issue is stale because it has been open for 15 days with no activity.

github-actions[bot] avatar Feb 27 '24 15:02 github-actions[bot]

This issue was closed because it has been inactive for 2 days since being marked as stale.

github-actions[bot] avatar Mar 08 '24 15:03 github-actions[bot]