peft Error when resuming the saved checkpoint

I applied LoRA to my model and saved the model and optimizer states using Huggingface accelerate as follow:

peft_config = LoraConfig(...)
model.enable_input_require_grads()
model = get_peft_model(model, peft_config)

model = accelerator.prepare_model(model)

optimizer = AdamW(model.parameters(), ...)
optimizer = accelerator.prepare_optimizer(optimizer)

...

accelerator.save_state(model_weight_path)

Then, I tried to resume using the saved states. However, I got the following error:

Traceback (most recent call last):
  File "scripts/sft/run_train_lora.py", line 508, in <module>
    main()
  File "scripts/sft/run_train_lora.py", line 502, in main
    run(artifact_config, train_config, experiment_config, execution_config)
  File "scripts/sft/run_train_lora.py", line 335, in run
    accelerator.load_state(experiment_config.resume_checkpoint_path)
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/accelerator.py", line 2376, in load_state
    self.state.fsdp_plugin.load_optimizer(self, opt, self._models[i], input_dir, i)
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/utils/dataclasses.py", line 951, in load_optimizer
    optimizer.load_state_dict(sharded_osd)
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/optimizer.py", line 102, in load_state_dict
    self.optimizer.load_state_dict(state_dict)
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 244, in load_state_dict
    self.__setstate__({'state': state, 'param_groups': param_groups})
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/adamw.py", line 102, in __setstate__
    step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
KeyError: 'step'

I think the states of the optimizer for the LoRA model were not saved appropriately. How can I solve the problem?

Apr 19 '23 09:04 Kyeongpil

Hello, what is the accelerate config and launch command? Why is it using code related to FSDP: .fsdp_plugin.load_optimizer(?

Apr 20 '23 09:04 pacman100

I use accelerate with FSDP. The following is my accelerate config:

{
    "compute_environment": "LOCAL_MACHINE",
    "deepspeed_config": {},
    "distributed_type": "FSDP",
    "downcast_bf16": "no",
    "dynamo_config": {},
    "fsdp_config": {
        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
        "fsdp_backward_prefetch_policy": "BACKWARD_PRE",
        "fsdp_offload_params": true,
        "fsdp_sharding_strategy": 1,
        "fsdp_state_dict_type": "FULL_STATE_DICT",
        "fsdp_transformer_layer_cls_to_wrap": "GPT2Block",
        "limit_all_gathers": true
    },
    "machine_rank": 0,
    "main_training_function": "main",
    "megatron_lm_config": {},
    "mixed_precision": "bf16",
    "num_machines": 1,
    "num_processes": 4,
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_env": [],
    "tpu_use_cluster": false,
    "tpu_use_sudo": false,
    "use_cpu": false
}

Apr 21 '23 01:04 Kyeongpil

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

May 19 '23 15:05 github-actions[bot]

peft peft copied to clipboard

Error when resuming the saved checkpoint

peft
peft copied to clipboard