alignment-handbook icon indicating copy to clipboard operation
alignment-handbook copied to clipboard

Training on LORA using multi-gpu is giving constant loss

Open sids07 opened this issue 1 year ago • 5 comments

I am trying to train yi-34B model using LORA setup on multi-gpu. But i am getting constant loss i.e. around 2 throughout my SFT training on 4 epochs. And inferencing trained model is giving useless response.

sids07 avatar Dec 11 '23 13:12 sids07

could you please provide the code used for training?

and some infos about the gpus you used

DRXD1000 avatar Dec 18 '23 09:12 DRXD1000

@DRXD1000 I am using sft trainer script from this repo: https://github.com/huggingface/alignment-handbook/blob/main/scripts/run_sft.py

Regarding GPU i am using 4*A100 (80GB) GPU from runpods.

sids07 avatar Dec 21 '23 16:12 sids07

Hm.. I guess there is either a cuda problem if you are doing 4bit or 8bit training or something wrong with your training data or script.

DRXD1000 avatar Jan 04 '24 08:01 DRXD1000

@DRXD1000 i am sure there is no problem on data or script as i am following same script i am able to finetune models upto 34B but for mixtral 8*7B i am facing CUDAOUTOFMEMORY so i wanted to try LORA. Also with LORA same script works fine for single GPU. And i am not doing 4bit or 8bit, though i have activate bfloat16.

fusesid avatar Jan 04 '24 11:01 fusesid

Mabey you could try with torch.autocast("cuda"): trainer.train()

If this does not work you could try the script from the mixtral blogpost on huggingface https://huggingface.co/blog/mixtral#fine-tuning-with-%F0%9F%A4%97-trl

DRXD1000 avatar Jan 04 '24 15:01 DRXD1000