ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

使用p-tuning多卡微调OOM

Open Rorschaaaach opened this issue 1 year ago • 9 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

使用4*32g的单机多卡微调直接oom,修改了train.sh中的CUDA_VISIBLE_DEVICES=0,1,2,3 Traceback (most recent call last): File "main.py", line 431, in main() File "main.py", line 370, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/data/ChatGLM-6B-main/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/data/ChatGLM-6B-main/ptuning/trainer.py", line 1904, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/data/ChatGLM-6B-main/ptuning/trainer.py", line 2665, in training_step loss.backward() File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 34, in backward return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs) File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/nn/parallel/functions.py", line 45, in forward return comm.reduce_add_coalesced(grads, destination) File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 143, in reduce_add_coalesced flat_result = reduce_add(flat_tensors, destination) File "/root/miniconda3/envs/ChatGLMv2/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 95, in reduce_add result = torch.empty_like(inputs[root_index]) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 31.75 GiB total capacity; 30.53 GiB already allocated; 87.69 MiB free; 30.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected Behavior

No response

Steps To Reproduce

Environment

- OS:ubuntu 18.04
- Python:3.8.15
- Transformers:4.28.1
- PyTorch:1.13.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

Rorschaaaach avatar May 04 '23 09:05 Rorschaaaach

出现了类似的报错;

eziohzy avatar May 12 '23 03:05 eziohzy

你这个是显存溢出了

pc123s avatar May 12 '23 08:05 pc123s

出现了类似的错误,四张卡,但是依然oom。

Dusangrm avatar May 14 '23 06:05 Dusangrm

解决了吗?我也是,多张卡就报oom,

DuBaiSheng avatar May 19 '23 10:05 DuBaiSheng

我也是 请问如何解决

lixingbu-tal avatar May 22 '23 02:05 lixingbu-tal

修改CUDA_VISIBLE_DEVICES=0,1,2,3还OOM的话就是改小batch size,再有就是降低max_source_length

Dusangrm avatar May 22 '23 02:05 Dusangrm

我现在单卡能训练,占用13GB显存。多卡就会直接OOM,不知道是什么问题

alanlaye617 avatar May 26 '23 05:05 alanlaye617

多卡显存占的不一样大。

eternalgogi92 avatar May 26 '23 06:05 eternalgogi92

@xiangchu95 @DuBaiSheng @DuBaiSheng 大佬们,多卡情况下,耗时远远大于单卡的情况,请问各位大佬应该如何解决呢

niuhuluzhihao avatar Jun 16 '23 16:06 niuhuluzhihao

Duplicate of #890

zhangch9 avatar Aug 16 '23 12:08 zhangch9

我现在单卡能训练,占用13GB显存。多卡就会直接OOM,不知道是什么问题

同样的问题,哪怕per_device_train_batch_size设置成1,请问大佬解决这个问题吗

yuhp-zts avatar Feb 23 '24 01:02 yuhp-zts