LLaMA-Factory
LLaMA-Factory copied to clipboard
预训练codeqwen1.5-7b时显存分布异常,训练一段时间后爆OOM
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
训练框架为LLaMA-Factory-0.7.0
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=eth10
model_path=codeqwen1.5-7B
dataset=codeqwen_0305
outputdir=codeqwen-pt-0527-new0305dataset
gradient_accumulation_steps=2
per_device_batchsize=2
epoch_num=2
learning_rate=1.5e-05
deepspeed --hostfile hostfile.txt --master_addr=2.0.0.1 src/train.py --model_name_or_path $model_path --stage pt \
--dataset $dataset \
--finetuning_type full \
--overwrite_cache true \
--flash_attn fa2 \
--preprocessing_num_workers 64 \
--template default \
--output_dir $outputdir \
--bf16 true \
--lr_scheduler_type cosine \
--do_train true \
--do_eval true \
--packing false \
--gradient_accumulation_steps $gradient_accumulation_steps \
--gradient_checkpointing true \
--learning_rate $learning_rate \
--log_level passive \
--logging_steps 10 \
--logging_strategy steps \
--max_steps -1 \
--num_train_epochs $epoch_num \
--report_to tensorboard \
--weight_decay 0.01 \
--cutoff_len 8192 \
--warmup_ratio 0.02 \
--eval_steps 200 \
--val_size 0.01 \
--evaluation_strategy steps \
--overwrite_output_dir true \
--per_device_train_batch_size $per_device_batchsize \
--remove_unused_columns true \
--save_strategy epoch \
--plot_loss \
--save_total_limit 3 \
--save_safetensors true \
--deepspeed=ds_z3_lr_schedule.json
Expected behavior
codeqwen1.5-7B在进行continue pretrain时所用显存异常地大,且在训练一段时间后出现OOM
ib125: return F.cross_entropy(input, target, weight=self.weight,
ib125: File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
ib125: loss = loss_fct(shift_logits, shift_labels)
ib125: File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ib125: return self._call_impl(*args, **kwargs)
ib125: File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ib125: return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ib125: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.50 GiB. GPU 3 has a total capacty of 79.15 GiB of which 7.69 GiB is free. Including non-PyTorch memory, this process 4 GiB memory in use. Of the allocated memory 47.39 GiB is allocated by PyTorch, and 23.23 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_ze_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ib125: return forward_call(*args, **kwargs)
ib125: File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1179, in forward
ib125: return F.cross_entropy(input, target, weight=self.weight,
ib125: File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
ib125: return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ib125: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.50 GiB. GPU 1 has a total capacty of 79.15 GiB of which 7.39 GiB is free. Including non-PyTorch memory, this process 4 GiB memory in use. Of the allocated memory 47.40 GiB is allocated by PyTorch, and 23.52 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_ze_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ib125: Traceback (most recent call last):
ib125: File "/var/mntpkg/LLaMA-Factory-0.7.0/src/train.py", line 14, in <module>
ib125: main()
ib125: File "/var/mntpkg/LLaMA-Factory-0.7.0/src/train.py", line 5, in main
ib125: run_exp()
ib125: File "/var/mntpkg/LLaMA-Factory-0.7.0/src/llmtuner/train/tuner.py", line 31, in run_exp
ib125: run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
ib125: File "/var/mntpkg/LLaMA-Factory-0.7.0/src/llmtuner/train/pt/workflow.py", line 47, in run_pt
ib125: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
ib125: File "/home/chatgpt/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB Off | 00000000:1F:00.0 Off | 0 |
| N/A 50C P0 116W / 400W | 74831MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB Off | 00000000:25:00.0 Off | 0 |
| N/A 65C P0 148W / 400W | 69291MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB Off | 00000000:50:00.0 Off | 0 |
| N/A 66C P0 125W / 400W | 60269MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB Off | 00000000:55:00.0 Off | 0 |
| N/A 52C P0 125W / 400W | 36859MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB Off | 00000000:90:00.0 Off | 0 |
| N/A 52C P0 147W / 400W | 36783MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB Off | 00000000:95:00.0 Off | 0 |
| N/A 66C P0 163W / 400W | 36961MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB Off | 00000000:CB:00.0 Off | 0 |
| N/A 64C P0 123W / 400W | 60133MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB Off | 00000000:D1:00.0 Off | 0 |
| N/A 50C P0 140W / 400W | 36889MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
System Info
一开始发生OOM时我使用的是2节点,16张GPU
- A100-SXM4-80GB X 16
- `transformers` version: 4.41.1
- Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.23.1
- Safetensors version: 0.4.2
- Accelerate version: 0.27.2
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- use_cpu: False
- debug: True
- num_processes: 16
- machine_rank: 0
- num_machines: 2
- main_process_ip: 2.0.0.1
- main_process_port: 9995
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'deepspeed_config_file': 'deepspeed_z2_config_bf16.json', 'deepspeed_multinode_launcher': 'standard', 'zero3_init_flag': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Others
之前我曾进行过多次模型训练,正常情况下训练7B的模型在这个batchsize与cutoff_len下不会爆OOM,并且通过nvidia-smi时能看出显存分配很不均匀。
暂时不清楚是训练框架的原因还是模型架构的原因,希望有大佬能解答。