alignment-handbook
alignment-handbook copied to clipboard
Training on LORA using multi-gpu is giving constant loss
I am trying to train yi-34B model using LORA setup on multi-gpu. But i am getting constant loss i.e. around 2 throughout my SFT training on 4 epochs. And inferencing trained model is giving useless response.
could you please provide the code used for training?
and some infos about the gpus you used
@DRXD1000 I am using sft trainer script from this repo: https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_sft.py
Regarding GPU i am using 4*A100 (80GB) GPU from runpods.
Hm.. I guess there is either a cuda problem if you are doing 4bit or 8bit training or something wrong with your training data or script.
@DRXD1000 i am sure there is no problem on data or script as i am following same script i am able to finetune models upto 34B but for mixtral 8*7B i am facing CUDAOUTOFMEMORY so i wanted to try LORA. Also with LORA same script works fine for single GPU. And i am not doing 4bit or 8bit, though i have activate bfloat16.
Mabey you could try with torch.autocast("cuda"): trainer.train()
If this does not work you could try the script from the mixtral blogpost on huggingface https://huggingface.co/blog/mixtral#fine-tuning-with-%F0%9F%A4%97-trl