Training finishes but errors out while trying to write out final LoRA weights (unable to resume)
Please check that this issue hasn't been reported before.
- [x] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I had a run to QLora train Gemma 3 27b IT with a private dataset on 4 A100 40GB cards using deepspeed Zero 1. It runs for two epochs, completes, but appears to end while trying to save the pre-trained model. The expected behavior is for the training run to complete without error, or to be able to continue from where it errored out to get the final LoRA weights and merged model.
Current behaviour
The run is launched in a Docker container formed from the winglian/axolotl:0.9.0 Docker image. The Docker log is below at the point of the failure is below:
[..snip..]
{'eval_loss': 0.47610339522361755, 'eval_runtime': 284.6172, 'eval_samples_per_second': 36.751, 'eval_steps_per_second': 4.596, 'epoch': 1.5}
[..snip..]
{'loss': 0.3956, 'grad_norm': 0.7203356027603149, 'learning_rate': 3.101717419102701e-08, 'epoch': 1.9}
{'loss': 0.4055, 'grad_norm': 0.7273358702659607, 'learning_rate': 2.18723433556009e-08, 'epoch': 1.92}
{'loss': 0.4046, 'grad_norm': 0.6099262833595276, 'learning_rate': 1.4315023363467296e-08, 'epoch': 1.93}
{'loss': 0.4023, 'grad_norm': 0.7447728514671326, 'learning_rate': 8.350055501717136e-09, 'epoch': 1.95}
{'loss': 0.4019, 'grad_norm': 0.786586344242096, 'learning_rate': 3.9812609823594584e-09, 'epoch': 1.96}
{'loss': 0.4063, 'grad_norm': 0.734952986240387, 'learning_rate': 1.2114384944172942e-09, 'epoch': 1.98}
{'loss': 0.4107, 'grad_norm': 0.6151812672615051, 'learning_rate': 4.2362411064034157e-11, 'epoch': 2.0}
{'train_runtime': 70287.9947, 'train_samples_per_second': 29.464, 'train_steps_per_second': 0.177, 'train_loss': 0.4729765019273244, 'epoch': 2.0}
[2025-05-04 21:02:24,370] [INFO] [axolotl.train.save_trained_model:233] [PID:25] [RANK:0] Training completed! Saving pre-trained model to ./lora-out.
[rankN]: Traceback (most recent call last):
[..snip..]
[rankN]: File "/workspace/axolotl/src/axolotl/cli/train.py", line 51, in do_train
[rankN]: model, tokenizer, trainer = train(cfg=cfg, dataset_meta=dataset_meta)
[rankN]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rankN]: File "/workspace/axolotl/src/axolotl/train.py", line 535, in train
[rankN]: cleanup_distributed()
[rankN]: File "/workspace/axolotl/src/axolotl/utils/distributed.py", line 99, in cleanup_distributed
[rankN]: torch.distributed.destroy_process_group()
[rankN]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2146, in destroy_process_group
[rankN]: _shutdown_backend(pg_to_shutdown)
[rankN]: File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1815, in _shutdown_backend
[rankN]: backend._shutdown()
[rankN]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:133, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rankN]: ncclUnhandledCudaError: Call to CUDA function failed.
[rankN]: Last error:
[rankN]: Cuda failure 'out of memory'
Oddly, there was no memory issue until after the training. Below is the GPU memory use during the run by the container:
What steps can I take to continue where it left off, saving the final LoRA weights and merging them into a model?
The process that was running and was interrupted with the memory error was about to invoke axolotl.cli.merge_lora with the configuration file. I don't think auto_resume_from_checkpoints: true would be relevant, since the previous checkpoint before completion was only 1.5 epochs in, but I don't know if the final LoRa weights were ever saved. Do I have to resume the training from the checkpoint at the 1.5 epoch mark (with more memory to avoid the error) to complete the training properly?
Any help would be greatly appreciated.
Steps to reproduce
The dataset is private and can't be shared but the training can was launched this way: $ accelerate launch -m axolotl.cli.train ./config.yml
Config yaml
base_model: google/gemma-3-27b-it
#model_type: AutoModelForCausalLM
#tokenizer_type: AutoTokenizer
deepspeed: /path/to/zero1.json
load_in_8bit: false
load_in_4bit: true
strict: false
#rl: orpo
#orpo_alpha: 0.1
datasets:
- path: ../path/to/data.jsonl
type: #alpaca
system_prompt: ""
field_system: system
field_instruction: instruction
field_output: output
format: "<start_of_turn>user\n{input}<end_of_turn>\n<start_of_turn>model"
no_input_format: "<start_of_turn>user\n{instruction}<end_of_turn>\n<start_of_turn>model"
# no_input_format: "<|im_start|>system\n.<|im_end|>\n<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant\n"
# - path: argilla/ultrafeedback-binarized-preferences-cleaned
# type: chat_template.argilla
# chat_template: chatml
dataset_prepared_path: last_run_prepared # -- XX Not in their configs
val_set_size: 0.01
output_dir: ./lora-out
adapter: qlora
sequence_len: 2048 #2048 w/out orpo and 4096 w/
sample_packing: true # - only if not doing ORPO/DPO
pad_to_sequence_len: true
save_safetensors: true
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
wandb_project: dibia-axolotl
wandb_entity:
wandb_watch:
wandb_name: dibia-gemma-3
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.000005
train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 100
#save_strategy: "no"
save_steps: .25
xformers_attention:
flash_attention: true
#loss_watchdog_threshold: 5.0
#loss_watchdog_patience: 3
warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
#saves_per_epoch: 1
debug:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: "<bos>"
eos_token: "<eos>"
unk_token: "<unk>"
Possible solution
No response
Which Operating Systems are you using?
- [ ] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.11
axolotl branch-commit
winglian/axolotl:0.9.0
Acknowledgements
- [x] My issue title is concise, descriptive, and in title casing.
- [x] I have searched the existing issues to make sure this bug has not been reported yet.
- [x] I am using the latest version of axolotl.
- [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.
Below is the directory of output_dir:
total 660300
drwxr-xr-x 1 root root 262 May 5 00:43 .
drwxr-xr-x 1 root root 91 May 4 01:30 ..
-rw-r--r-- 1 root root 4394 May 5 00:12 README.md
-rw-r--r-- 1 root root 892 May 5 01:38 adapter_config.json
-rw-r--r-- 1 root root 636899136 May 5 00:12 adapter_model.safetensors
-rw-r--r-- 1 root root 35 May 5 01:38 added_tokens.json
-rw-r--r-- 1 root root 1615 May 5 01:38 chat_template.json
drwxr-xr-x 1 root root 286 May 4 06:28 checkpoint-3106
drwxr-xr-x 1 root root 286 May 4 11:21 checkpoint-6212
drwxr-xr-x 1 root root 286 May 4 16:14 checkpoint-9318
-rw-r--r-- 1 root root 2117 May 5 01:38 config.json
-rw-r--r-- 1 root root 570 May 5 01:38 preprocessor_config.json
-rw-r--r-- 1 root root 70 May 5 01:38 processor_config.json
-rw-r--r-- 1 root root 662 May 5 01:38 special_tokens_map.json
-rw-r--r-- 1 root root 33384568 May 5 01:38 tokenizer.json
-rw-r--r-- 1 root root 4689074 May 5 01:38 tokenizer.model
-rw-r--r-- 1 root root 1156999 May 5 01:38 tokenizer_config.json
When I try to run accelerate launch -m axolotl.cli.merge_lora ./config.yml (after making sure auto_resume_from_checkpoints is set to true), I get the following error:
[..snip..]
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/peft/peft_model.py", line 1272, in load_adapter
adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/peft/utils/save_and_load.py", line 567, in load_peft_weights
adapters_weights = safe_load_file(filename, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/safetensors/torch.py", line 313, in load_file
with safe_open(filename, framework="pt", device=device) as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
Does that mean I have no choice but to merge the earlier checkpoint-9318?
My hypothesis for the last error is that the last checkpoint in disk that it is trying to resume/merge was incompletely saved/failed, so it is trying to read from that. You can also try axolotl merge_lora ./config.yml --lora-model-dir=./output_dir/checkpoint-9318/ to explicitly point to the directory of the lora you want to merge. see https://docs.axolotl.ai/docs/cli.html#merge-lora
Related: #2321
it could be cleanup_distributed() is called after model saving, and NCCL's cleanup process requires additional GPU memory that wasn't available ? if this is the case we can just add a clear cache before clean up