axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

Process hanged when using cpu offloading

Open mohit-217 opened this issue 9 months ago • 2 comments

Please check that this issue hasn't been reported before.

  • [x] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Training should start after the first step

Current behaviour

It just stuck when using zero3_cpu_offloading.json

Steps to reproduce

base_model: Qwen/Qwen2.5-1.5B is_qwen_derived_model: true is_llama_derived_model: false resume_from_checkpoint: seed: 42 load_in_8bit: false load_in_4bit: false strict: false shuffle_merged_datasets: true trust_remote_code: bf16: auto fp16: tf32: false model_init_kwargs: init_device: "meta" resize_token_embeddings_strategy: "mean" chat_template: qwen_25 datasets:

  • path: json data_files: 1k.jsonl type: chat_template field_messages: messages

trust_remote_code: true dataset_prepared_path: data/last_run_prepared hub_model_id: sequence_len: 8192 pad_to_sequence_len: true sample_packing: false output_dir: data/output gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 3 warmup_steps: 100 learning_rate: 1e-5 eval_steps: 100 save_steps: 100 save_total_limit: #auto_find_batch_size: load_best_model_at_end: true metric_for_best_model: "eval_loss" greater_is_better: false eval_table_size: 1 group_by_length: false train_on_inputs: false gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: true early_stopping_patience: 3 lr_scheduler: cosine weight_decay: 0.01 max_grad_norm: 1.0 dropout: 0.01 xformers_attention: deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_params.json flash_attention: true optimization: use_liger: true flash_attention_implementation: "variant-3-liger" optimizer: adamw_torch lr_scheduler: cosine plugins:

  • axolotl.integrations.liger.LigerPlugin liger_rope: true liger_rms_norm: true liger_glu_activation: true liger_layer_norm: true liger_fused_linear_cross_entropy: true val_set_size: 0.01 do_eval: true ddp_timeout: 300000

Config yaml


Possible solution

No response

Which Operating Systems are you using?

  • [ ] Linux
  • [ ] macOS
  • [ ] Windows

Python Version

3.10.12

axolotl branch-commit

mqin

Acknowledgements

  • [x] My issue title is concise, descriptive, and in title casing.
  • [x] I have searched the existing issues to make sure this bug has not been reported yet.
  • [x] I am using the latest version of axolotl.
  • [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

mohit-217 avatar Mar 11 '25 21:03 mohit-217

Image

mohit-217 avatar Mar 11 '25 21:03 mohit-217

Hey, what kind of system is this run on? Would you be able to fix the config formatting?

NanoCode012 avatar Mar 12 '25 09:03 NanoCode012