DeepSpeed
DeepSpeed copied to clipboard
Deepspeed fails with frozen weights (e.g. only train llama2 embedding layer)
Describe the bug This bug is similar to #4055 , I provide a repro here.
To Reproduce
Please put these three files in the same directory (remember to change the first two .txt -> .py and deepspeed_config.txt -> deepspeed_config.yaml), and reproduce the result with:
accelerate launch --config_file "deepspeed_config.yaml" train_test.py --model_name "NousResearch/Llama-2-7b-hf" \
--dataset_name "smangrul/code-chat-assistant-v1" --max_seq_len 512 --max_steps 1000 --logging_steps 25 --eval_steps 100 \
--save_steps 500 --bf16 True --packing True --output_dir "full-finetune-llama-chat-asst" --per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 --dataset_text_field "content" --use_gradient_checkpointing --learning_rate 5e-5 \
--lr_scheduler_type "cosine" --weight_decay 0.01 --warmup_ratio 0.03 --use_flash_attn True
train_test.txt utils.txt deepspeed_config.txt
Currently, the code runs fine, but if I uncomment these three lines (147 to 149 in the file train_test.py), the code will throw an error as follows:
# for param in model.parameters():
# param.requires_grad = False
# model.get_input_embeddings().requires_grad = True
errors:
Traceback (most recent call last):
File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/train_test.py", line 190, in <module>
main(args)
File "/home/yuzhounie/projects/DHS-LLM-Workshop/chat_assistant/training/train_test.py", line 184, in main
trainer.train()
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1689, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1284, in prepare
result = self._prepare_deepspeed(*args)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/accelerator.py", line 1666, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1225, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1552, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/home/yuzhounie/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 146, in __init__
self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
System info (please complete the following information):
- OS: Ubuntu 22.04
- GPU count and types: 1 machine with x8 A100s
- Python version: 3.10.13
Launcher context accelerate launch
I was freezing my input embeddings the same way as you, using deepspeed2 and the resulting weights can't be read back in, maybe related?
for param in emb.parameters():
param.requires_grad = False
And getting the same problem, where it can't re load the weights because of a missing emb.weight
I've dropped a breakpoint() here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/utils/zero_to_fp32.py#L105
and observed this:
(Pdb) [x for x in state_dict['module'] if 'emb' in x]
['_forward_module.emb.weight']
(Pdb) [x for x in state_dict[PARAM_SHAPES] if 'emb' in x]
[]
(Pdb) state_dict[FROZEN_PARAM_SHAPES]
None
So they're in the state_dict, but not in the state_dict[FROZEN_PARAM_SHAPES].
This is as far as I've been able to debug, hopefully this helps more debuggage.
edit: i've also confirmed that the only place in the entire state_dict that my emb shows up at is under 'module':
{'module': OrderedDict([('_forward_module.emb.weight', tensor([[ ...]])]), ...}