axolotl
axolotl copied to clipboard
Unable to resume Deepspeed Zero 1 training
Please check that this issue hasn't been reported before.
- [x] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I had a previous training run end prematurely (#2633), and I'm trying to resume it using auto_resume_from_checkpoints, but am unable to do so and get a _pickle.UnpicklingError: Weights only load failed.
error. The expected behavior is that the training should resume from the most recent checkpoint.
Current behaviour
Currently, after launching the training, I'm getting the following, despite running it with the same hardware configuration and within the same image, and exactly as it was at the time of the original failure that interrupted the training process:
[..snip..]
[2025-05-05 01:37:30,809] [DEBUG] [axolotl.train.setup_model_and_tokenizer:80] [PID:25] [RANK:0] loading model and peft_config...
You have set `use_cache` to `False`, but cache_implementation is set to hybrid. cache_implementation will have no effect.
[..snip..]
[2025-05-05 01:38:50,662] [INFO] [axolotl.train.save_initial_configs:361] [PID:25] [RANK:0] Pre-saving adapter config to ./lora-out...
[2025-05-05 01:38:50,663] [INFO] [axolotl.train.save_initial_configs:365] [PID:25] [RANK:0] Pre-saving tokenizer to ./lora-out...
[2025-05-05 01:38:51,270] [INFO] [axolotl.train.save_initial_configs:368] [PID:25] [RANK:0] Pre-saving model config to ./lora-out...
[2025-05-05 01:38:51,291] [INFO] [axolotl.train.save_initial_configs:372] [PID:25] [RANK:0] Pre-saving processor to ./lora-out...
[2025-05-05 01:38:55,524] [INFO] [axolotl.train.determine_resume_checkpoint:143] [PID:25] [RANK:0] Using Auto-resume functionality to start with checkpoint at lora-out/checkpoint-9318
[2025-05-05 01:38:55,525] [INFO] [axolotl.train.execute_training:213] [PID:25] [RANK:0] Starting trainer...
[2025-05-05 01:39:12,729] [WARNING] [engine.py:1232:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution
[..snip..]
[rank2]: File "/workspace/axolotl/src/axolotl/cli/train.py", line 121, in <module>
[rank2]: fire.Fire(do_cli)
[..snip..]
[rank2]: File "/workspace/axolotl/src/axolotl/cli/train.py", line 51, in do_train
[rank2]: model, tokenizer, trainer = train(cfg=cfg, dataset_meta=dataset_meta)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/workspace/axolotl/src/axolotl/train.py", line 529, in train
[rank2]: execute_training(cfg, trainer, resume_from_checkpoint)
[rank2]: File "/workspace/axolotl/src/axolotl/train.py", line 215, in execute_training
[rank2]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank2]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2245, in train
[rank2]: return inner_training_loop(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2398, in _inner_training_loop
[rank2]: deepspeed_load_checkpoint(
[rank2]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/transformers/integrations/deepspeed.py", line 489, in deepspeed_load_checkpoint
[rank2]: load_path, _ = deepspeed_engine.load_checkpoint(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2826, in load_checkpoint
[rank2]: success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 3011, in _load_zero_checkpoint
[rank2]: zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 3088, in _get_all_zero_checkpoints
[rank2]: return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 3067, in _get_all_zero_checkpoint_state_dicts
[rank2]: _state = self.checkpoint_engine.load(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 28, in load
[rank2]: partition = torch.load(path, map_location=map_location)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/serialization.py", line 1470, in load
[rank2]: raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank2]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
[rank2]: (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank2]: (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank2]: WeightsUnpickler error: Unsupported global: GLOBAL deepspeed.runtime.fp16.loss_scaler.LossScaler was not an allowed global by default. Please use `torch.serialization.add_safe_globals([LossScaler])` or the `torch.serialization.safe_globals([LossScaler])` context manager to allowlist this global if you trust this class/function.
[rank2]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
Steps to reproduce
It was executed via:
accelerate launch -m axolotl.cli.train ./config.yml
Config yaml
base_model: google/gemma-3-27b-it
#model_type: AutoModelForCausalLM
#tokenizer_type: AutoTokenizer
deepspeed: /path/to/zero1.json
load_in_8bit: false
load_in_4bit: true
strict: false
#rl: orpo
#orpo_alpha: 0.1
datasets:
- path: ../path/to/data.jsonl
type: #alpaca
system_prompt: ""
field_system: system
field_instruction: instruction
field_output: output
format: "<start_of_turn>user\n{input}<end_of_turn>\n<start_of_turn>model"
no_input_format: "<start_of_turn>user\n{instruction}<end_of_turn>\n<start_of_turn>model"
# no_input_format: "<|im_start|>system\n.<|im_end|>\n<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant\n"
# - path: argilla/ultrafeedback-binarized-preferences-cleaned
# type: chat_template.argilla
# chat_template: chatml
dataset_prepared_path: last_run_prepared # -- XX Not in their configs
val_set_size: 0.01
output_dir: ./lora-out
adapter: qlora
sequence_len: 2048 #2048 w/out orpo and 4096 w/
sample_packing: true # - only if not doing ORPO/DPO
pad_to_sequence_len: true
save_safetensors: true
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
wandb_project: dibia-axolotl
wandb_entity:
wandb_watch:
wandb_name: dibia-gemma-3
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.000005
train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 100
#save_strategy: "no"
save_steps: .25
xformers_attention:
flash_attention: true
#loss_watchdog_threshold: 5.0
#loss_watchdog_patience: 3
warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
#saves_per_epoch: 1
debug:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: "<bos>"
eos_token: "<eos>"
unk_token: "<unk>"
# pad_token: "[PAD]"
Possible solution
No response
Which Operating Systems are you using?
- [x] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.11
axolotl branch-commit
winglian/axolotl:0.9.0 Docker Image
Acknowledgements
- [x] My issue title is concise, descriptive, and in title casing.
- [x] I have searched the existing issues to make sure this bug has not been reported yet.
- [x] I am using the latest version of axolotl.
- [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.
This is a know recent upstream issue to prevent a CVE since torch.load is unsafe. We don't have a workaround yet for this unfortunately. I'll keep you posted as we figure something out.
This is a related PR, I believe there was a similar change in transformers prior to this as well. https://github.com/huggingface/transformers/pull/37785
also related https://github.com/huggingface/transformers/pull/36991 and https://github.com/huggingface/transformers/pull/36991
I think this is duplicate of #2610 . Can you give upgrading your deepspeed version a try?
Hi @NanoCode012 , I have updated my deepspeed to the newest version and I still met the bug:
[rank0]: Traceback (most recent call last):
[rank0]: File "/inspire/ssd/project/robot3d/mazipei-253107140027/WorldActionModel/main.py", line 69, in <module>
[rank0]: main()
[rank0]: File "/inspire/ssd/project/robot3d/mazipei-253107140027/WorldActionModel/main.py", line 45, in main
[rank0]: runner.train()
[rank0]: File "/inspire/ssd/project/robot3d/mazipei-253107140027/WorldActionModel/runner/ltx_video_trainer.py", line 489, in train
[rank0]: self.state.accelerator.load_state(resume_dir, weights_only=False)
[rank0]: File "/opt/miniconda3/envs/genie_envisioner/lib/python3.10/site-packages/accelerate/accelerator.py", line 3690, in load_state
[rank0]: model.load_checkpoint(input_dir, ckpt_id, **load_model_func_kwargs)
[rank0]: TypeError: DeepSpeedEngine.load_checkpoint() got an unexpected keyword argument 'weights_only'
My deepspeed:
deepspeed-0.18.0
Any advice would be greatly appreciated.
@xiao10ma , I'm not familiar with that stack trace? I believe you may be using custom code and the error is outside of Axolotl?
@NanoCode012 Sorry, I think it's a bug from the accelerate.
Hey @xiao10ma , try to keep package versions within the versions listed on the requirements txt . These are the versions we check against. Can you see if upgrading the versions to match current works for you?