qlora
qlora copied to clipboard
Model finished training, but adapter_model.bin is empty?
I started the training using:
python qlora.py \
--model_name_or_path /home/nap/llm_models/llamaOG-65B-hf/ \
--output_dir ./output \
--dataset alpaca \
--do_train True \
--do_eval True \
--do_mmlu_eval False \
--source_max_len 384 \
--target_max_len 128 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--logging_steps 10 \
--max_steps 10000 \
--save_strategy steps \
--data_seed 42 \
--save_steps 1000 \
--save_total_limit 40 \
--evaluation_strategy steps \
--eval_dataset_size 1024 \
--max_eval_samples 1000 \
--eval_steps 1000 \
--optim paged_adamw_32bit \
--learning_rate 0.0001
It took 2.5 days but completed successfully, I checked the /output folder to find all of the checkpoint folders, but I don't think I have the final output (an adapter_model.bin around ~3gb)
Am I just being dumb? Thanks!
No, it's a bug
See https://github.com/artidoro/qlora/pull/44
@KKcorps I see, thank you...
Since the adaptor files weren't written properly during checkpoints, I'm guessing that would require retraining after the fix? =x
if you have pytorch.bin files in the checkpoint dir then it won't but otherwise it might
I had some luck with my port of alpaca-lora to QLoRa. You can try it from https://github.com/vihangd/alpaca-qlora Though I have only tested on Open LLama 3b model