axolotl Training finishes but errors out while trying to write out final LoRA weights (unable to resume)

Please check that this issue hasn't been reported before.

[x] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I had a run to QLora train Gemma 3 27b IT with a private dataset on 4 A100 40GB cards using deepspeed Zero 1. It runs for two epochs, completes, but appears to end while trying to save the pre-trained model. The expected behavior is for the training run to complete without error, or to be able to continue from where it errored out to get the final LoRA weights and merged model.

Current behaviour

The run is launched in a Docker container formed from the winglian/axolotl:0.9.0 Docker image. The Docker log is below at the point of the failure is below:

[..snip..]
{'eval_loss': 0.47610339522361755, 'eval_runtime': 284.6172, 'eval_samples_per_second': 36.751, 'eval_steps_per_second': 4.596, 'epoch': 1.5}
[..snip..]
{'loss': 0.3956, 'grad_norm': 0.7203356027603149, 'learning_rate': 3.101717419102701e-08, 'epoch': 1.9}
{'loss': 0.4055, 'grad_norm': 0.7273358702659607, 'learning_rate': 2.18723433556009e-08, 'epoch': 1.92}
{'loss': 0.4046, 'grad_norm': 0.6099262833595276, 'learning_rate': 1.4315023363467296e-08, 'epoch': 1.93}
{'loss': 0.4023, 'grad_norm': 0.7447728514671326, 'learning_rate': 8.350055501717136e-09, 'epoch': 1.95}
{'loss': 0.4019, 'grad_norm': 0.786586344242096, 'learning_rate': 3.9812609823594584e-09, 'epoch': 1.96}
{'loss': 0.4063, 'grad_norm': 0.734952986240387, 'learning_rate': 1.2114384944172942e-09, 'epoch': 1.98}
{'loss': 0.4107, 'grad_norm': 0.6151812672615051, 'learning_rate': 4.2362411064034157e-11, 'epoch': 2.0}
{'train_runtime': 70287.9947, 'train_samples_per_second': 29.464, 'train_steps_per_second': 0.177, 'train_loss': 0.4729765019273244, 'epoch': 2.0}
[2025-05-04 21:02:24,370] [INFO] [axolotl.train.save_trained_model:233] [PID:25] [RANK:0] Training completed! Saving pre-trained model to ./lora-out.
[rankN]: Traceback (most recent call last):
[..snip..]
[rankN]:   File "/workspace/axolotl/src/axolotl/cli/train.py", line 51, in do_train
[rankN]:     model, tokenizer, trainer = train(cfg=cfg, dataset_meta=dataset_meta)
[rankN]:                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rankN]:   File "/workspace/axolotl/src/axolotl/train.py", line 535, in train
[rankN]:     cleanup_distributed()
[rankN]:   File "/workspace/axolotl/src/axolotl/utils/distributed.py", line 99, in cleanup_distributed
[rankN]:     torch.distributed.destroy_process_group()
[rankN]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2146, in destroy_process_group
[rankN]:     _shutdown_backend(pg_to_shutdown)
[rankN]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1815, in _shutdown_backend
[rankN]:     backend._shutdown()
[rankN]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:133, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rankN]: ncclUnhandledCudaError: Call to CUDA function failed.
[rankN]: Last error:
[rankN]: Cuda failure 'out of memory'

Oddly, there was no memory issue until after the training. Below is the GPU memory use during the run by the container:

What steps can I take to continue where it left off, saving the final LoRA weights and merging them into a model?

The process that was running and was interrupted with the memory error was about to invoke axolotl.cli.merge_lora with the configuration file. I don't think auto_resume_from_checkpoints: true would be relevant, since the previous checkpoint before completion was only 1.5 epochs in, but I don't know if the final LoRa weights were ever saved. Do I have to resume the training from the checkpoint at the 1.5 epoch mark (with more memory to avoid the error) to complete the training properly?

Any help would be greatly appreciated.

Steps to reproduce

The dataset is private and can't be shared but the training can was launched this way: $ accelerate launch -m axolotl.cli.train ./config.yml

Config yaml

base_model: google/gemma-3-27b-it
#model_type: AutoModelForCausalLM
#tokenizer_type: AutoTokenizer

deepspeed: /path/to/zero1.json

load_in_8bit: false
load_in_4bit: true
strict: false

#rl: orpo
#orpo_alpha: 0.1

datasets:
  - path: ../path/to/data.jsonl
    type: #alpaca
        system_prompt: ""
        field_system: system
        field_instruction: instruction
        field_output: output
        format: "<start_of_turn>user\n{input}<end_of_turn>\n<start_of_turn>model"
        no_input_format: "<start_of_turn>user\n{instruction}<end_of_turn>\n<start_of_turn>model"

#        no_input_format: "<|im_start|>system\n.<|im_end|>\n<|im_start|>user\n{instruction}<|im_end|>\n<|im_start|>assistant\n"

#  - path: argilla/ultrafeedback-binarized-preferences-cleaned
#    type: chat_template.argilla
#    chat_template: chatml

dataset_prepared_path: last_run_prepared # -- XX Not in their configs
val_set_size: 0.01
output_dir: ./lora-out

adapter: qlora

sequence_len: 2048 #2048 w/out orpo and 4096 w/
sample_packing: true   # - only if not doing ORPO/DPO
pad_to_sequence_len: true

save_safetensors: true

lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project: dibia-axolotl
wandb_entity:
wandb_watch:
wandb_name: dibia-gemma-3
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.000005

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 100
#save_strategy: "no"
save_steps: .25
xformers_attention:
flash_attention: true

#loss_watchdog_threshold: 5.0
#loss_watchdog_patience: 3

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
#saves_per_epoch: 1
debug:
weight_decay: 0.0
fsdp:
fsdp_config:

special_tokens:
  bos_token: "<bos>"
  eos_token: "<eos>"
  unk_token: "<unk>"

Possible solution

No response

Which Operating Systems are you using?

[ ] Linux
[ ] macOS
[ ] Windows

Python Version

3.11

axolotl branch-commit

winglian/axolotl:0.9.0

Acknowledgements

[x] My issue title is concise, descriptive, and in title casing.
[x] I have searched the existing issues to make sure this bug has not been reported yet.
[x] I am using the latest version of axolotl.
[x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

May 05 '25 13:05 chimezie

Below is the directory of output_dir:

total 660300
drwxr-xr-x 1 root root       262 May  5 00:43 .
drwxr-xr-x 1 root root        91 May  4 01:30 ..
-rw-r--r-- 1 root root      4394 May  5 00:12 README.md
-rw-r--r-- 1 root root       892 May  5 01:38 adapter_config.json
-rw-r--r-- 1 root root 636899136 May  5 00:12 adapter_model.safetensors
-rw-r--r-- 1 root root        35 May  5 01:38 added_tokens.json
-rw-r--r-- 1 root root      1615 May  5 01:38 chat_template.json
drwxr-xr-x 1 root root       286 May  4 06:28 checkpoint-3106
drwxr-xr-x 1 root root       286 May  4 11:21 checkpoint-6212
drwxr-xr-x 1 root root       286 May  4 16:14 checkpoint-9318
-rw-r--r-- 1 root root      2117 May  5 01:38 config.json
-rw-r--r-- 1 root root       570 May  5 01:38 preprocessor_config.json
-rw-r--r-- 1 root root        70 May  5 01:38 processor_config.json
-rw-r--r-- 1 root root       662 May  5 01:38 special_tokens_map.json
-rw-r--r-- 1 root root  33384568 May  5 01:38 tokenizer.json
-rw-r--r-- 1 root root   4689074 May  5 01:38 tokenizer.model
-rw-r--r-- 1 root root   1156999 May  5 01:38 tokenizer_config.json

When I try to run accelerate launch -m axolotl.cli.merge_lora ./config.yml (after making sure auto_resume_from_checkpoints is set to true), I get the following error:

[..snip..]
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/peft/peft_model.py", line 1272, in load_adapter
    adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/peft/utils/save_and_load.py", line 567, in load_peft_weights
    adapters_weights = safe_load_file(filename, device=device)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/safetensors/torch.py", line 313, in load_file
    with safe_open(filename, framework="pt", device=device) as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

Does that mean I have no choice but to merge the earlier checkpoint-9318?

May 06 '25 12:05 chimezie

My hypothesis for the last error is that the last checkpoint in disk that it is trying to resume/merge was incompletely saved/failed, so it is trying to read from that. You can also try axolotl merge_lora ./config.yml --lora-model-dir=./output_dir/checkpoint-9318/ to explicitly point to the directory of the lora you want to merge. see https://docs.axolotl.ai/docs/cli.html#merge-lora

May 07 '25 04:05 winglian

Related: #2321

May 20 '25 21:05 mashdragon

it could be cleanup_distributed() is called after model saving, and NCCL's cleanup process requires additional GPU memory that wasn't available ? if this is the case we can just add a clear cache before clean up

Aug 07 '25 07:08 ved1beta