qlora
qlora copied to clipboard
MosaicMl compatibility?
Would this be compatible?
I don't think it is compatible.
Here is what I tried:
- Install qlora and all deps
-
pip install einops
- Run training
python qlora.py \
--model_name_or_path mosaicml/mpt-7b \
--trust_remote_code True \
--output_dir output \
--dataset alpaca \
--do_train True \
--do_eval True \
--do_mmlu_eval True \
--source_max_len 384 \
--target_max_len 128 \
--per_device_train_batch_size 4 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--logging_steps 10 \
--max_steps 10000 \
--save_strategy steps \
--data_seed 42 \
--save_steps 1000 \
--save_total_limit 40 \
--evaluation_strategy steps \
--eval_dataset_size 1024 \
--max_eval_samples 1000 \
--eval_steps 1000 \
--optim paged_adamw_32bit
Where these two lines are different from the default cmd:
--model_name_or_path mosaicml/mpt-7b \
--trust_remote_code True
And then I get this:
/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/4ff95c4aec5c04ba509ddf517c56720541a7a487/attention.py:157: UserWarning: Using `attn_impl: torch`. If your model does not use `alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
Traceback (most recent call last):
File "/code/qlora/qlora.py", line 758, in <module>
train()
File "/code/qlora/qlora.py", line 590, in train
model = get_accelerate_model(args, checkpoint_dir)
File "/code/qlora/qlora.py", line 263, in get_accelerate_model
model = AutoModelForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 485, in from_pretrained
return model_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2737, in from_pretrained
raise ValueError(f"{model.__class__.__name__} does not support `device_map='{device_map}'` yet.")
Looks like they are working on it: https://huggingface.co/mosaicml/mpt-7b/discussions/23
You can work around the above error using the solution from the discussion above.
So you checkout the model manually:
git lfs install
git clone https://huggingface.co/mosaicml/mpt-7b
And then you manually make this one line change: https://huggingface.co/mosaicml/mpt-7b/commit/d8a52ba8f9fb1e8127c88717d6a80792ef991774
And then you set --model_name_or_path "./mpt-7b"
If you then run training again you run into this: https://huggingface.co/mosaicml/mpt-7b/discussions/41
Which you can work around by setting --gradient_checkpointing False
(which will of course increase the memory required for training.)
Then you will run into this:
Traceback (most recent call last):
File "/code/qlora/qlora.py", line 758, in <module>
train()
File "/code/qlora/qlora.py", line 720, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint_dir)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1973, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2787, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2819, in compute_loss
outputs = model(**inputs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 686, in forward
return self.base_model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
TypeError: MPTForCausalLM.forward() got an unexpected keyword argument 'inputs_embeds'
Not sure if that's an MPT, a transformer or a qlora problem.
Turns out it's peft that is adding the input_embeds
parameter to the call.
I accidentally stumbled over this: https://huggingface.co/cekal/mpt-7b-peft-compatible
Which fixes the input embed problem as well as the gradient checkpointing thing.
There is just a small problem when it's used with parallelization that I fixed here:
https://huggingface.co/cekal/mpt-7b-peft-compatible/discussions/2
So far this trains (can't confirm if it's actually doing anything useful so far):
git clone https://huggingface.co/cekal/mpt-7b-peft-compatible
pushd mpt-7b-peft-compatible
git fetch origin refs/pr/2:pr/2
git checkout pr/2
popd
python qlora.py \
--model_name_or_path ./mpt-7b-peft-compatible \
--trust_remote_code True \
--output_dir output \
--dataset alpaca \
--do_train True \
--do_eval True \
--do_mmlu_eval True \
--source_max_len 384 \
--target_max_len 128 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--logging_steps 10 \
--max_steps 10000 \
--save_strategy steps \
--data_seed 42 \
--save_steps 1000 \
--save_total_limit 40 \
--evaluation_strategy steps \
--eval_dataset_size 1024 \
--max_eval_samples 1000 \
--eval_steps 1000 \
--optim paged_adamw_32bit
Trying to make this all a bit more straight-forward: https://huggingface.co/mosaicml/mpt-7b/discussions/42
The fix has been added to the main branch of mpt-7b-peft-compatible
.
So now you can just run this:
python qlora.py \
--model_name_or_path cekal/mpt-7b-peft-compatible \
--trust_remote_code True \
--output_dir output \
--dataset alpaca \
--do_train True \
--do_eval True \
--do_mmlu_eval True \
--source_max_len 384 \
--target_max_len 128 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--logging_steps 10 \
--max_steps 10000 \
--save_strategy steps \
--data_seed 42 \
--save_steps 1000 \
--save_total_limit 40 \
--evaluation_strategy steps \
--eval_dataset_size 1024 \
--max_eval_samples 1000 \
--eval_steps 1000 \
--optim paged_adamw_32bit
I forked the chat model on huggingface, adding support for device map auto, gradient checkpointing and ignoring input_embeds=None argument:
https://huggingface.co/Birchlabs/mosaicml-mpt-7b-chat-qlora
Have you tried with triton flash attention for training yet with this?
The fix has been added to the main branch of
mpt-7b-peft-compatible
. So now you can just run this:python qlora.py \ --model_name_or_path cekal/mpt-7b-peft-compatible \ --trust_remote_code True \ --output_dir output \ --dataset alpaca \ --do_train True \ --do_eval True \ --do_mmlu_eval True \ --source_max_len 384 \ --target_max_len 128 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 4 \ --logging_steps 10 \ --max_steps 10000 \ --save_strategy steps \ --data_seed 42 \ --save_steps 1000 \ --save_total_limit 40 \ --evaluation_strategy steps \ --eval_dataset_size 1024 \ --max_eval_samples 1000 \ --eval_steps 1000 \ --optim paged_adamw_32bit
How to inference from the stored model?
Does this work with the 30b model?