peft icon indicating copy to clipboard operation
peft copied to clipboard

Error when resuming the saved checkpoint

Open Kyeongpil opened this issue 1 year ago • 2 comments

I applied LoRA to my model and saved the model and optimizer states using Huggingface accelerate as follow:

peft_config = LoraConfig(...)
model.enable_input_require_grads()
model = get_peft_model(model, peft_config)

model = accelerator.prepare_model(model)

optimizer = AdamW(model.parameters(), ...)
optimizer = accelerator.prepare_optimizer(optimizer)

...

accelerator.save_state(model_weight_path)

Then, I tried to resume using the saved states. However, I got the following error:

Traceback (most recent call last):
  File "scripts/sft/run_train_lora.py", line 508, in <module>
    main()
  File "scripts/sft/run_train_lora.py", line 502, in main
    run(artifact_config, train_config, experiment_config, execution_config)
  File "scripts/sft/run_train_lora.py", line 335, in run
    accelerator.load_state(experiment_config.resume_checkpoint_path)
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/accelerator.py", line 2376, in load_state
    self.state.fsdp_plugin.load_optimizer(self, opt, self._models[i], input_dir, i)
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/utils/dataclasses.py", line 951, in load_optimizer
    optimizer.load_state_dict(sharded_osd)
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/accelerate/optimizer.py", line 102, in load_state_dict
    self.optimizer.load_state_dict(state_dict)
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/optimizer.py", line 244, in load_state_dict
    self.__setstate__({'state': state, 'param_groups': param_groups})
  File "/home/kyeongpil/venv/lib/python3.8/site-packages/torch/optim/adamw.py", line 102, in __setstate__
    step_is_tensor = (len(state_values) != 0) and torch.is_tensor(state_values[0]['step'])
KeyError: 'step'

I think the states of the optimizer for the LoRA model were not saved appropriately. How can I solve the problem?

Kyeongpil avatar Apr 19 '23 09:04 Kyeongpil

Hello, what is the accelerate config and launch command? Why is it using code related to FSDP: .fsdp_plugin.load_optimizer(?

pacman100 avatar Apr 20 '23 09:04 pacman100

I use accelerate with FSDP. The following is my accelerate config:

{
    "compute_environment": "LOCAL_MACHINE",
    "deepspeed_config": {},
    "distributed_type": "FSDP",
    "downcast_bf16": "no",
    "dynamo_config": {},
    "fsdp_config": {
        "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
        "fsdp_backward_prefetch_policy": "BACKWARD_PRE",
        "fsdp_offload_params": true,
        "fsdp_sharding_strategy": 1,
        "fsdp_state_dict_type": "FULL_STATE_DICT",
        "fsdp_transformer_layer_cls_to_wrap": "GPT2Block",
        "limit_all_gathers": true
    },
    "machine_rank": 0,
    "main_training_function": "main",
    "megatron_lm_config": {},
    "mixed_precision": "bf16",
    "num_machines": 1,
    "num_processes": 4,
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_env": [],
    "tpu_use_cluster": false,
    "tpu_use_sudo": false,
    "use_cpu": false
}

Kyeongpil avatar Apr 21 '23 01:04 Kyeongpil

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar May 19 '23 15:05 github-actions[bot]