Process hanged when using cpu offloading
Please check that this issue hasn't been reported before.
- [x] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Training should start after the first step
Current behaviour
It just stuck when using zero3_cpu_offloading.json
Steps to reproduce
base_model: Qwen/Qwen2.5-1.5B is_qwen_derived_model: true is_llama_derived_model: false resume_from_checkpoint: seed: 42 load_in_8bit: false load_in_4bit: false strict: false shuffle_merged_datasets: true trust_remote_code: bf16: auto fp16: tf32: false model_init_kwargs: init_device: "meta" resize_token_embeddings_strategy: "mean" chat_template: qwen_25 datasets:
- path: json data_files: 1k.jsonl type: chat_template field_messages: messages
trust_remote_code: true dataset_prepared_path: data/last_run_prepared hub_model_id: sequence_len: 8192 pad_to_sequence_len: true sample_packing: false output_dir: data/output gradient_accumulation_steps: 1 micro_batch_size: 1 num_epochs: 3 warmup_steps: 100 learning_rate: 1e-5 eval_steps: 100 save_steps: 100 save_total_limit: #auto_find_batch_size: load_best_model_at_end: true metric_for_best_model: "eval_loss" greater_is_better: false eval_table_size: 1 group_by_length: false train_on_inputs: false gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: true early_stopping_patience: 3 lr_scheduler: cosine weight_decay: 0.01 max_grad_norm: 1.0 dropout: 0.01 xformers_attention: deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_params.json flash_attention: true optimization: use_liger: true flash_attention_implementation: "variant-3-liger" optimizer: adamw_torch lr_scheduler: cosine plugins:
- axolotl.integrations.liger.LigerPlugin liger_rope: true liger_rms_norm: true liger_glu_activation: true liger_layer_norm: true liger_fused_linear_cross_entropy: true val_set_size: 0.01 do_eval: true ddp_timeout: 300000
Config yaml
Possible solution
No response
Which Operating Systems are you using?
- [ ] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.10.12
axolotl branch-commit
mqin
Acknowledgements
- [x] My issue title is concise, descriptive, and in title casing.
- [x] I have searched the existing issues to make sure this bug has not been reported yet.
- [x] I am using the latest version of axolotl.
- [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.
Hey, what kind of system is this run on? Would you be able to fix the config formatting?