axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

Unable to resume Deepspeed Zero 1 training

Open chimezie opened this issue 6 months ago • 8 comments

Please check that this issue hasn't been reported before.

  • [x] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I had a previous training run end prematurely (#2633), and I'm trying to resume it using auto_resume_from_checkpoints, but am unable to do so and get a _pickle.UnpicklingError: Weights only load failed. error. The expected behavior is that the training should resume from the most recent checkpoint.

Current behaviour

Currently, after launching the training, I'm getting the following, despite running it with the same hardware configuration and within the same image, and exactly as it was at the time of the original failure that interrupted the training process:

[..snip..]
[2025-05-05 01:37:30,809] [DEBUG] [axolotl.train.setup_model_and_tokenizer:80] [PID:25] [RANK:0] loading model and peft_config...
You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect.
[..snip..]
[2025-05-05 01:38:50,662] [INFO] [axolotl.train.save_initial_configs:361] [PID:25] [RANK:0] Pre-saving adapter config to ./lora-out...
[2025-05-05 01:38:50,663] [INFO] [axolotl.train.save_initial_configs:365] [PID:25] [RANK:0] Pre-saving tokenizer to ./lora-out...
[2025-05-05 01:38:51,270] [INFO] [axolotl.train.save_initial_configs:368] [PID:25] [RANK:0] Pre-saving model config to ./lora-out...
[2025-05-05 01:38:51,291] [INFO] [axolotl.train.save_initial_configs:372] [PID:25] [RANK:0] Pre-saving processor to ./lora-out...
[2025-05-05 01:38:55,524] [INFO] [axolotl.train.determine_resume_checkpoint:143] [PID:25] [RANK:0] Using Auto-resume functionality to start with checkpoint at lora-out/checkpoint-9318
[2025-05-05 01:38:55,525] [INFO] [axolotl.train.execute_training:213] [PID:25] [RANK:0] Starting trainer...
[2025-05-05 01:39:12,729] [WARNING] [engine.py:1232:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution
[..snip..]
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 121, in <module>
[rank2]:     fire.Fire(do_cli)
[..snip..]
[rank2]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 51, in do_train
[rank2]:     model, tokenizer, trainer = train(cfg=cfg, dataset_meta=dataset_meta)
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/workspace/axolotl/src/axolotl/train.py", line 529, in train
[rank2]:     execute_training(cfg, trainer, resume_from_checkpoint)
[rank2]:   File "/workspace/axolotl/src/axolotl/train.py", line 215, in execute_training
[rank2]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank2]:     return inner_training_loop(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2398, in _inner_training_loop
[rank2]:     deepspeed_load_checkpoint(
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/integrations/deepspeed.py", line 489, in deepspeed_load_checkpoint
[rank2]:     load_path, _ = deepspeed_engine.load_checkpoint(
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2826, in load_checkpoint
[rank2]:     success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 3011, in _load_zero_checkpoint
[rank2]:     zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 3088, in _get_all_zero_checkpoints
[rank2]:     return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 3067, in _get_all_zero_checkpoint_state_dicts
[rank2]:     _state = self.checkpoint_engine.load(
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 28, in load
[rank2]:     partition = torch.load(path, map_location=map_location)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/serialization.py", line 1470, in load
[rank2]:     raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank2]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
[rank2]:        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank2]:        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank2]:        WeightsUnpickler error: Unsupported global: GLOBAL deepspeed.runtime.fp16.loss_scaler.LossScaler was not an allowed global by default. Please use `torch.serialization.add_safe_globals([LossScaler])` or the `torch.serialization.safe_globals([LossScaler])` context manager to allowlist this global if you trust this class/function.
[rank2]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

Steps to reproduce

It was executed via:

accelerate launch -m axolotl.cli.train ./config.yml

Config yaml

base_model: google/gemma-3-27b-it
#model_type: AutoModelForCausalLM
#tokenizer_type: AutoTokenizer

deepspeed: /path/to/zero1.json

load_in_8bit: false
load_in_4bit: true
strict: false

#rl: orpo
#orpo_alpha: 0.1

datasets:
  - path: ../path/to/data.jsonl
    type: #alpaca
        system_prompt: ""
        field_system: system
        field_instruction: instruction
        field_output: output
        format: "<start_of_turn>user\n{input}<end_of_turn>\n<start_of_turn>model"
        no_input_format: "<start_of_turn>user\n{instruction}<end_of_turn>\n<start_of_turn>model"

#        no_input_format: "<|im_start|>system\n.<|im_end|>\n<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant\n"

#  - path: argilla/ultrafeedback-binarized-preferences-cleaned
#    type: chat_template.argilla
#    chat_template: chatml

dataset_prepared_path: last_run_prepared # -- XX Not in their configs
val_set_size: 0.01
output_dir: ./lora-out

adapter: qlora

sequence_len: 2048 #2048 w/out orpo and 4096 w/
sample_packing: true   # - only if not doing ORPO/DPO
pad_to_sequence_len: true

save_safetensors: true

lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project: dibia-axolotl
wandb_entity:
wandb_watch:
wandb_name: dibia-gemma-3
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.000005

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 100
#save_strategy: "no"
save_steps: .25
xformers_attention:
flash_attention: true

#loss_watchdog_threshold: 5.0
#loss_watchdog_patience: 3

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
#saves_per_epoch: 1
debug:
weight_decay: 0.0
fsdp:
fsdp_config:

special_tokens:
  bos_token: "<bos>"
  eos_token: "<eos>"
  unk_token: "<unk>"
#  pad_token: "[PAD]"

Possible solution

No response

Which Operating Systems are you using?

  • [x] Linux
  • [ ] macOS
  • [ ] Windows

Python Version

3.11

axolotl branch-commit

winglian/axolotl:0.9.0 Docker Image

Acknowledgements

  • [x] My issue title is concise, descriptive, and in title casing.
  • [x] I have searched the existing issues to make sure this bug has not been reported yet.
  • [x] I am using the latest version of axolotl.
  • [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

chimezie avatar May 05 '25 14:05 chimezie

This is a know recent upstream issue to prevent a CVE since torch.load is unsafe. We don't have a workaround yet for this unfortunately. I'll keep you posted as we figure something out.

winglian avatar May 07 '25 04:05 winglian

This is a related PR, I believe there was a similar change in transformers prior to this as well. https://github.com/huggingface/transformers/pull/37785

winglian avatar May 07 '25 04:05 winglian

also related https://github.com/huggingface/transformers/pull/36991 and https://github.com/huggingface/transformers/pull/36991

winglian avatar May 07 '25 04:05 winglian

I think this is duplicate of #2610 . Can you give upgrading your deepspeed version a try?

NanoCode012 avatar May 07 '25 05:05 NanoCode012

Hi @NanoCode012 , I have updated my deepspeed to the newest version and I still met the bug:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/inspire/ssd/project/robot3d/mazipei-253107140027/WorldActionModel/main.py", line 69, in <module>
[rank0]:     main()
[rank0]:   File "/inspire/ssd/project/robot3d/mazipei-253107140027/WorldActionModel/main.py", line 45, in main
[rank0]:     runner.train()
[rank0]:   File "/inspire/ssd/project/robot3d/mazipei-253107140027/WorldActionModel/runner/ltx_video_trainer.py", line 489, in train
[rank0]:     self.state.accelerator.load_state(resume_dir, weights_only=False)
[rank0]:   File "/opt/miniconda3/envs/genie_envisioner/lib/python3.10/site-packages/accelerate/accelerator.py", line 3690, in load_state
[rank0]:     model.load_checkpoint(input_dir, ckpt_id, **load_model_func_kwargs)
[rank0]: TypeError: DeepSpeedEngine.load_checkpoint() got an unexpected keyword argument 'weights_only'

My deepspeed:

deepspeed-0.18.0

Any advice would be greatly appreciated.

xiao10ma avatar Oct 13 '25 15:10 xiao10ma

@xiao10ma , I'm not familiar with that stack trace? I believe you may be using custom code and the error is outside of Axolotl?

NanoCode012 avatar Oct 14 '25 09:10 NanoCode012

@NanoCode012 Sorry, I think it's a bug from the accelerate.

xiao10ma avatar Oct 19 '25 08:10 xiao10ma

Hey @xiao10ma , try to keep package versions within the versions listed on the requirements txt . These are the versions we check against. Can you see if upgrading the versions to match current works for you?

NanoCode012 avatar Oct 20 '25 07:10 NanoCode012