LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

阿里云v100微调chatglm3-6b,显存并没使用多少,出现OutOfMemoryError: CUDA out of memory

Open ysskevin opened this issue 10 months ago • 4 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

a.显存使用情况: NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 12.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:08.0 Off | 0 | | N/A 36C P0 56W / 300W | 1290MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================== c.微调模型:chatglm3-6b b.报错 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 15.78 GiB of which 9.75 MiB is free. Process 4037 has 15.77 GiB memory in use. Of the allocated memory 14.38 GiB is allocated by PyTorch, and 147.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior

如何解决这个显存问题

System Info

阿里云: 镜像配置 镜像名称 modelscope:1.13.3-pytorch2.1.2tensorflow2.14.0-gpu-py310-cu121-ubuntu22.04 镜像 ID image-pn44enfzcrn4c7wloz 镜像 URL dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.13.3-pytorch2.1.2tensorflow2.14.0-gpu-py310-cu121-ubuntu22.04 资源配置 资源配额 公共资源组 规格名称 ecs.gn6v-c8g1.2xlarge CPU 8 内存 32 GiB GPU 1 型号 NVIDIA V100

Others

No response

ysskevin avatar Apr 10 '24 05:04 ysskevin

未说明使用了什么参数设置来训练,无法判断问题。

codemayq avatar Apr 10 '24 05:04 codemayq

未说明使用了什么参数设置来训练,无法判断问题。

全是默认参数,只把默认的训练3轮改成了50轮

ysskevin avatar Apr 10 '24 07:04 ysskevin

额,我还是不知道你是在做什么训练。至少比如你是参考哪个 脚本,作的是预训练,还是SFT,还是什么。

codemayq avatar Apr 10 '24 07:04 codemayq

IMG_20240410_125254.jpg

ysskevin avatar Apr 10 '24 07:04 ysskevin

批处理大小调成 1

hiyouga avatar Apr 10 '24 16:04 hiyouga

还是报错, image

ysskevin avatar Apr 11 '24 03:04 ysskevin

解决了吗?怎么处理的?

xuanxuanxuanxuan avatar Aug 08 '24 08:08 xuanxuanxuanxuan