LLaMA-Factory
LLaMA-Factory copied to clipboard
128卡 A800 80G qwen2 7b cut_off 8192报错oom
Reminder
- [X] I have read the README and searched the existing issues.
System Info
model
model_name_or_path: /mnt/nas/shanzhi/eval_models/Qwen2-72B
method
stage: sft do_train: true finetuning_type: full
ddp
ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json
dataset
dataset: model_filing_toxicity,tagengo_train_formatted,google_ift_data_v1,google_ift_data_v2,google_ift_data_v3,self_cognition_aib_multilingual,ultra_chat_200k_train_sft,glaive-code,MetaMathQA,MathInstruct,mh_org_CoT_collection_fr_remove_keywords,mh_org_CoT_collection_ja_remove_keywords,mh_org_CoT_collection_ko_remove_keywords,mh_org_CoT_collection_ru_remove_keywords,mh_org_CoT_collection_zh2_remove_keywords,mh_org_ar_remove_keywords,mh_org_bn_remove_keywords,mh_org_de_remove_keywords,mh_org_en_remove_keywords,mh_org_es_remove_keywords,mh_org_fr_remove_keywords,mh_org_he_remove_keywords,mh_org_id_remove_keywords,mh_org_ja_remove_keywords,mh_org_ko_remove_keywords,mh_org_my_remove_keywords,mh_org_nl_remove_keywords,mh_org_pl_remove_keywords,mh_org_pt_remove_keywords,mh_org_ru_remove_keywords,mh_org_ta_remove_keywords,mh_org_te_remove_keywords,mh_org_th_remove_keywords,mh_org_tr_remove_keywords,mh_org_ur_remove_keywords,mh_org_vi_remove_keywords,mh_org_zh_remove_keywords,mh_org_orca_remove_keywords_1,mh_org_orca_remove_keywords_2,openqa_dedup template: qwen cutoff_len: 8192
max_samples: 1000
overwrite_cache: true preprocessing_num_workers: 128
output
output_dir: /mnt/nas/liyadong/sft_models/Qwen2-7B-alldata-packing-bs1024-lr4e-6-5epoch-32k logging_steps: 10 save_steps: 500 save_strategy: epoch plot_loss: true overwrite_output_dir: true
train
flash_attn: fa2 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 0.000004 num_train_epochs: 5.0 lr_scheduler_type: cosine warmup_steps: 0.1 bf16: true neftune_noise_alpha: 5 packing: true
eval
val_size: 0.1
per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500
Reproduction
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 0 has a total capacty of 79.35 GiB of which 64.00 GiB is free. Process 1844 has 15.28 GiB memory in use. Of the allocated memory 9.63 GiB is allocated by PyTorch, and 4.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Expected behavior
No response
Others
No response