LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

预训练codeqwen1.5-7b时显存分布异常,训练一段时间后爆OOM

Open Cucunnber opened this issue 1 month ago • 4 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

训练框架为LLaMA-Factory-0.7.0

export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=eth10

model_path=codeqwen1.5-7B

dataset=codeqwen_0305

outputdir=codeqwen-pt-0527-new0305dataset

gradient_accumulation_steps=2

per_device_batchsize=2

epoch_num=2

learning_rate=1.5e-05


deepspeed  --hostfile hostfile.txt --master_addr=2.0.0.1 src/train.py --model_name_or_path $model_path  --stage pt \
--dataset $dataset \
--finetuning_type  full \
--overwrite_cache  true \
--flash_attn fa2 \
--preprocessing_num_workers 64 \
--template default \
--output_dir $outputdir \
--bf16  true  \
--lr_scheduler_type  cosine \
--do_train  true  \
--do_eval true \
--packing false \
--gradient_accumulation_steps  $gradient_accumulation_steps \
--gradient_checkpointing  true \
--learning_rate  $learning_rate \
--log_level  passive \
--logging_steps  10 \
--logging_strategy  steps \
--max_steps  -1 \
--num_train_epochs $epoch_num \
--report_to tensorboard \
--weight_decay 0.01 \
--cutoff_len 8192 \
--warmup_ratio 0.02 \
--eval_steps 200 \
--val_size 0.01 \
--evaluation_strategy steps \
--overwrite_output_dir  true  \
--per_device_train_batch_size  $per_device_batchsize \
--remove_unused_columns  true \
--save_strategy epoch \
--plot_loss \
--save_total_limit 3 \
--save_safetensors  true  \
--deepspeed=ds_z3_lr_schedule.json

Expected behavior

codeqwen1.5-7B在进行continue pretrain时所用显存异常地大,且在训练一段时间后出现OOM

ib125:     return F.cross_entropy(input, target, weight=self.weight,
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
ib125:     loss = loss_fct(shift_logits, shift_labels)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ib125:     return self._call_impl(*args, **kwargs)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ib125:     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ib125: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.50 GiB. GPU 3 has a total capacty of 79.15 GiB of which 7.69 GiB is free. Including non-PyTorch memory, this process 4 GiB memory in use. Of the allocated memory 47.39 GiB is allocated by PyTorch, and 23.23 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_ze_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ib125:     return forward_call(*args, **kwargs)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1179, in forward
ib125:     return F.cross_entropy(input, target, weight=self.weight,
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
ib125:     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ib125: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.50 GiB. GPU 1 has a total capacty of 79.15 GiB of which 7.39 GiB is free. Including non-PyTorch memory, this process 4 GiB memory in use. Of the allocated memory 47.40 GiB is allocated by PyTorch, and 23.52 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_ze_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ib125: Traceback (most recent call last):
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/train.py", line 14, in <module>
ib125:     main()
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/train.py", line 5, in main
ib125:     run_exp()
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/llmtuner/train/tuner.py", line 31, in run_exp
ib125:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/llmtuner/train/pt/workflow.py", line 47, in run_pt
ib125:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train


+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:1F:00.0 Off |                    0 |
| N/A   50C    P0             116W / 400W |  74831MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:25:00.0 Off |                    0 |
| N/A   65C    P0             148W / 400W |  69291MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off | 00000000:50:00.0 Off |                    0 |
| N/A   66C    P0             125W / 400W |  60269MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off | 00000000:55:00.0 Off |                    0 |
| N/A   52C    P0             125W / 400W |  36859MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          Off | 00000000:90:00.0 Off |                    0 |
| N/A   52C    P0             147W / 400W |  36783MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          Off | 00000000:95:00.0 Off |                    0 |
| N/A   66C    P0             163W / 400W |  36961MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          Off | 00000000:CB:00.0 Off |                    0 |
| N/A   64C    P0             123W / 400W |  60133MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          Off | 00000000:D1:00.0 Off |                    0 |
| N/A   50C    P0             140W / 400W |  36889MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

System Info

一开始发生OOM时我使用的是2节点,16张GPU

  • A100-SXM4-80GB X 16
- `transformers` version: 4.41.1
- Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.23.1
- Safetensors version: 0.4.2
- Accelerate version: 0.27.2
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: True
        - num_processes: 16
        - machine_rank: 0
        - num_machines: 2
        - main_process_ip: 2.0.0.1
        - main_process_port: 9995
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'deepspeed_config_file': 'deepspeed_z2_config_bf16.json', 'deepspeed_multinode_launcher': 'standard', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Others

之前我曾进行过多次模型训练,正常情况下训练7B的模型在这个batchsize与cutoff_len下不会爆OOM,并且通过nvidia-smi时能看出显存分配很不均匀。

暂时不清楚是训练框架的原因还是模型架构的原因,希望有大佬能解答。

Cucunnber avatar May 27 '24 02:05 Cucunnber