LLaMA-Factory
LLaMA-Factory copied to clipboard
阿里云v100微调chatglm3-6b,显存并没使用多少,出现OutOfMemoryError: CUDA out of memory
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
a.显存使用情况: NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 12.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:08.0 Off | 0 | | N/A 36C P0 56W / 300W | 1290MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================== c.微调模型:chatglm3-6b b.报错 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 15.78 GiB of which 9.75 MiB is free. Process 4037 has 15.77 GiB memory in use. Of the allocated memory 14.38 GiB is allocated by PyTorch, and 147.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Expected behavior
如何解决这个显存问题
System Info
阿里云: 镜像配置 镜像名称 modelscope:1.13.3-pytorch2.1.2tensorflow2.14.0-gpu-py310-cu121-ubuntu22.04 镜像 ID image-pn44enfzcrn4c7wloz 镜像 URL dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.13.3-pytorch2.1.2tensorflow2.14.0-gpu-py310-cu121-ubuntu22.04 资源配置 资源配额 公共资源组 规格名称 ecs.gn6v-c8g1.2xlarge CPU 8 内存 32 GiB GPU 1 型号 NVIDIA V100
Others
No response
未说明使用了什么参数设置来训练,无法判断问题。
未说明使用了什么参数设置来训练,无法判断问题。
全是默认参数,只把默认的训练3轮改成了50轮
额,我还是不知道你是在做什么训练。至少比如你是参考哪个 脚本,作的是预训练,还是SFT,还是什么。
批处理大小调成 1
还是报错,
解决了吗?怎么处理的?