qlora icon indicating copy to clipboard operation
qlora copied to clipboard

Model finished training, but adapter_model.bin is empty?

Open disarmyouwitha opened this issue 1 year ago • 4 comments

I started the training using:

python qlora.py \
    --model_name_or_path /home/nap/llm_models/llamaOG-65B-hf/ \
    --output_dir ./output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval False \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit \
    --learning_rate 0.0001

It took 2.5 days but completed successfully, I checked the /output folder to find all of the checkpoint folders, but I don't think I have the final output (an adapter_model.bin around ~3gb)

Screenshot 2023-05-28 at 9 22 44 AM Screenshot 2023-05-28 at 9 42 24 AM

Am I just being dumb? Thanks!

disarmyouwitha avatar May 28 '23 14:05 disarmyouwitha

No, it's a bug

See https://github.com/artidoro/qlora/pull/44

KKcorps avatar May 28 '23 14:05 KKcorps

@KKcorps I see, thank you...

Since the adaptor files weren't written properly during checkpoints, I'm guessing that would require retraining after the fix? =x

disarmyouwitha avatar May 28 '23 15:05 disarmyouwitha

if you have pytorch.bin files in the checkpoint dir then it won't but otherwise it might

KKcorps avatar May 28 '23 15:05 KKcorps

I had some luck with my port of alpaca-lora to QLoRa. You can try it from https://github.com/vihangd/alpaca-qlora Though I have only tested on Open LLama 3b model

vihangd avatar May 29 '23 07:05 vihangd