autotrain-advanced icon indicating copy to clipboard operation
autotrain-advanced copied to clipboard

[BUG] Distant computer (4 GPUs 10GiB of VRAM each) crashes the second i launch the finetuning using AutoTrain localy of mistral 7B

Open hichambht32 opened this issue 9 months ago • 9 comments

Prerequisites

  • [X] I have read the documentation.
  • [X] I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config config.yml

UI Screenshots & Parameters

This is my CLI when the distant computer crashes : Capture d’écran (61) This is my Config.yml file : image

Error Logs

This is the error in BitVise when the computer crashes Capture d'écran 2024-05-10 100356

Additional Information

Hello Everyone, im a newbie to AutoTrain, im trying to finetune a 7B mistral param using AutoTrain localy (since i got 4 gpus with 10 GiB of Vram each) in a distant computer for which im using BitVise to connect via SSh. i was using the SFT Trainer of Huggingface and faced many issues related to CUDA memory for that reason i decided to move to Autotrain since its easier. N.B : I have done everything i know to optimize (LoRa, 4bit model loading, lower batch size, etc...) i also tried the DDP before (Distributed data paralel). I have found some memory usage estimation of fine tuning this same model in BF16 precision that says the peak of Adam is 56gb of Vram which i dont have for the moment. Do you think renting some GPUs from RunPod or Huggingface would fix the issue for me ? Thank you for your guidance.

hichambht32 avatar May 21 '24 13:05 hichambht32