128卡 A800 80G qwen2 7b cut_off 8192报错oom

Open BobTsang1995 opened this issue 7 months ago • 11 comments

Reminder

[X] I have read the README and searched the existing issues.

System Info

model

model_name_or_path: /mnt/nas/shanzhi/eval_models/Qwen2-72B

method

stage: sft do_train: true finetuning_type: full

ddp

ddp_timeout: 180000000 deepspeed: examples/deepspeed/ds_z3_offload_config.json

dataset

dataset: model_filing_toxicity,tagengo_train_formatted,google_ift_data_v1,google_ift_data_v2,google_ift_data_v3,self_cognition_aib_multilingual,ultra_chat_200k_train_sft,glaive-code,MetaMathQA,MathInstruct,mh_org_CoT_collection_fr_remove_keywords,mh_org_CoT_collection_ja_remove_keywords,mh_org_CoT_collection_ko_remove_keywords,mh_org_CoT_collection_ru_remove_keywords,mh_org_CoT_collection_zh2_remove_keywords,mh_org_ar_remove_keywords,mh_org_bn_remove_keywords,mh_org_de_remove_keywords,mh_org_en_remove_keywords,mh_org_es_remove_keywords,mh_org_fr_remove_keywords,mh_org_he_remove_keywords,mh_org_id_remove_keywords,mh_org_ja_remove_keywords,mh_org_ko_remove_keywords,mh_org_my_remove_keywords,mh_org_nl_remove_keywords,mh_org_pl_remove_keywords,mh_org_pt_remove_keywords,mh_org_ru_remove_keywords,mh_org_ta_remove_keywords,mh_org_te_remove_keywords,mh_org_th_remove_keywords,mh_org_tr_remove_keywords,mh_org_ur_remove_keywords,mh_org_vi_remove_keywords,mh_org_zh_remove_keywords,mh_org_orca_remove_keywords_1,mh_org_orca_remove_keywords_2,openqa_dedup template: qwen cutoff_len: 8192

max_samples: 1000

overwrite_cache: true preprocessing_num_workers: 128

output

output_dir: /mnt/nas/liyadong/sft_models/Qwen2-7B-alldata-packing-bs1024-lr4e-6-5epoch-32k logging_steps: 10 save_steps: 500 save_strategy: epoch plot_loss: true overwrite_output_dir: true

train

flash_attn: fa2 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 learning_rate: 0.000004 num_train_epochs: 5.0 lr_scheduler_type: cosine warmup_steps: 0.1 bf16: true neftune_noise_alpha: 5 packing: true

eval

val_size: 0.1

per_device_eval_batch_size: 1

evaluation_strategy: steps

eval_steps: 500

Reproduction

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 GiB. GPU 0 has a total capacty of 79.35 GiB of which 64.00 GiB is free. Process 1844 has 15.28 GiB memory in use. Of the allocated memory 9.63 GiB is allocated by PyTorch, and 4.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior

No response

Others

No response

Jul 13 '24 07:07 BobTsang1995

LLaMA-Factory LLaMA-Factory copied to clipboard

128卡 A800 80G qwen2 7b cut_off 8192报错oom

Reminder

System Info

model

method

ddp

dataset

max_samples: 1000

output

train

eval

val_size: 0.1

per_device_eval_batch_size: 1

evaluation_strategy: steps

eval_steps: 500

Reproduction

Expected behavior

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard