peft
peft copied to clipboard
Error when resuming the saved checkpoint
I applied LoRA to my model and saved the model and optimizer states using Huggingface accelerate as follow:
peft_config = LoraConfig(...)
model.enable_input_require_grads()
model = get_peft_model(model, peft_config)
model = accelerator.prepare_model(model)
optimizer = AdamW(model.parameters(), ...)
optimizer = accelerator.prepare_optimizer(optimizer)
...
accelerator.save_state(model_weight_path)
Then, I tried to resume using the saved states. However, I got the following error:
Traceback (most recent call last):
File "scripts/sft/run_train_lora.py", line 508, in <module>
main()
File "scripts/sft/run_train_lora.py", line 502, in main
run(artifact_config, train_config, experiment_config, execution_config)
File "scripts/sft/run_train_lora.py", line 335, in run
accelerator.load_state(experiment_config.resume_checkpoint_path)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/accelerator.py", line 2376, in load_state
self.state.fsdp_plugin.load_optimizer(self, opt, self._models[i], input_dir, i)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/utils/dataclasses.py", line 951, in load_optimizer
optimizer.load_state_dict(sharded_osd)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/optimizer.py", line 102, in load_state_dict
self.optimizer.load_state_dict(state_dict)
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 244, in load_state_dict
self.__setstate__({'state': state, 'param_groups': param_groups})
File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/adamw.py", line 102, in __setstate__
step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
KeyError: 'step'
I think the states of the optimizer for the LoRA model were not saved appropriately. How can I solve the problem?
Hello, what is the accelerate config and launch command? Why is it using code related to FSDP: .fsdp_plugin.load_optimizer(
?
I use accelerate with FSDP. The following is my accelerate config:
{
"compute_environment": "LOCAL_MACHINE",
"deepspeed_config": {},
"distributed_type": "FSDP",
"downcast_bf16": "no",
"dynamo_config": {},
"fsdp_config": {
"fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
"fsdp_backward_prefetch_policy": "BACKWARD_PRE",
"fsdp_offload_params": true,
"fsdp_sharding_strategy": 1,
"fsdp_state_dict_type": "FULL_STATE_DICT",
"fsdp_transformer_layer_cls_to_wrap": "GPT2Block",
"limit_all_gathers": true
},
"machine_rank": 0,
"main_training_function": "main",
"megatron_lm_config": {},
"mixed_precision": "bf16",
"num_machines": 1,
"num_processes": 4,
"rdzv_backend": "static",
"same_network": true,
"tpu_env": [],
"tpu_use_cluster": false,
"tpu_use_sudo": false,
"use_cpu": false
}
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.