qlora icon indicating copy to clipboard operation
qlora copied to clipboard

MosaicMl compatibility?

Open TashaSkyUp opened this issue 1 year ago • 10 comments

Would this be compatible?

TashaSkyUp avatar May 25 '23 00:05 TashaSkyUp

I don't think it is compatible.

Here is what I tried:

  1. Install qlora and all deps
  2. pip install einops
  3. Run training
python qlora.py \
    --model_name_or_path mosaicml/mpt-7b \
    --trust_remote_code True \
    --output_dir output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit

Where these two lines are different from the default cmd:

    --model_name_or_path mosaicml/mpt-7b \
    --trust_remote_code True

And then I get this:

/root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/4ff95c4aec5c04ba509ddf517c56720541a7a487/attention.py:157: UserWarning: Using `attn_impl: torch`. If your model does not use `alibi` or `prefix_lm` we recommend using `attn_impl: flash` otherwise we recommend using `attn_impl: triton`.
  warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
Traceback (most recent call last):
  File "/code/qlora/qlora.py", line 758, in <module>
    train()
  File "/code/qlora/qlora.py", line 590, in train
    model = get_accelerate_model(args, checkpoint_dir)
  File "/code/qlora/qlora.py", line 263, in get_accelerate_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 485, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2737, in from_pretrained
    raise ValueError(f"{model.__class__.__name__} does not support `device_map='{device_map}'` yet.")

muelletm avatar May 27 '23 11:05 muelletm

Looks like they are working on it: https://huggingface.co/mosaicml/mpt-7b/discussions/23

muelletm avatar May 27 '23 11:05 muelletm

You can work around the above error using the solution from the discussion above.

So you checkout the model manually:

git lfs install
git clone https://huggingface.co/mosaicml/mpt-7b

And then you manually make this one line change: https://huggingface.co/mosaicml/mpt-7b/commit/d8a52ba8f9fb1e8127c88717d6a80792ef991774

And then you set --model_name_or_path "./mpt-7b"

If you then run training again you run into this: https://huggingface.co/mosaicml/mpt-7b/discussions/41

Which you can work around by setting --gradient_checkpointing False

(which will of course increase the memory required for training.)

Then you will run into this:

Traceback (most recent call last):
  File "/code/qlora/qlora.py", line 758, in <module>
    train()
  File "/code/qlora/qlora.py", line 720, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint_dir)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1696, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1973, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2787, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2819, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 686, in forward
    return self.base_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
TypeError: MPTForCausalLM.forward() got an unexpected keyword argument 'inputs_embeds'

Not sure if that's an MPT, a transformer or a qlora problem.

muelletm avatar May 27 '23 12:05 muelletm

Turns out it's peft that is adding the input_embeds parameter to the call.

I accidentally stumbled over this: https://huggingface.co/cekal/mpt-7b-peft-compatible

Which fixes the input embed problem as well as the gradient checkpointing thing.

There is just a small problem when it's used with parallelization that I fixed here:

https://huggingface.co/cekal/mpt-7b-peft-compatible/discussions/2

So far this trains (can't confirm if it's actually doing anything useful so far):

git clone https://huggingface.co/cekal/mpt-7b-peft-compatible
pushd mpt-7b-peft-compatible
git fetch origin refs/pr/2:pr/2
git checkout pr/2
popd

python qlora.py \
    --model_name_or_path ./mpt-7b-peft-compatible \
    --trust_remote_code True \
    --output_dir output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit

muelletm avatar May 27 '23 15:05 muelletm

Trying to make this all a bit more straight-forward: https://huggingface.co/mosaicml/mpt-7b/discussions/42

muelletm avatar May 27 '23 15:05 muelletm

The fix has been added to the main branch of mpt-7b-peft-compatible. So now you can just run this:

python qlora.py \
    --model_name_or_path cekal/mpt-7b-peft-compatible \
    --trust_remote_code True \
    --output_dir output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit

muelletm avatar May 27 '23 18:05 muelletm

I forked the chat model on huggingface, adding support for device map auto, gradient checkpointing and ignoring input_embeds=None argument:
https://huggingface.co/Birchlabs/mosaicml-mpt-7b-chat-qlora

Birch-san avatar May 28 '23 09:05 Birch-san

Have you tried with triton flash attention for training yet with this?

mikeybellissimo avatar May 29 '23 23:05 mikeybellissimo

The fix has been added to the main branch of mpt-7b-peft-compatible. So now you can just run this:

python qlora.py \
    --model_name_or_path cekal/mpt-7b-peft-compatible \
    --trust_remote_code True \
    --output_dir output \
    --dataset alpaca \
    --do_train True \
    --do_eval True \
    --do_mmlu_eval True \
    --source_max_len 384 \
    --target_max_len 128 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 10 \
    --max_steps 10000 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 1000 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --eval_steps 1000 \
    --optim paged_adamw_32bit

How to inference from the stored model?

gowthaml15 avatar May 30 '23 10:05 gowthaml15

Does this work with the 30b model?

SinanAkkoyun avatar Jun 23 '23 19:06 SinanAkkoyun