LLaMA-Factory 阿里云v100微调chatglm3-6b,显存并没使用多少,出现OutOfMemoryError: CUDA out of memory

阿里云v100微调chatglm3-6b,显存并没使用多少,出现OutOfMemoryError: CUDA out of memory

Open ysskevin opened this issue 10 months ago • 4 comments

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================== c.微调模型:chatglm3-6b b.报错 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacty of 15.78 GiB of which 9.75 MiB is free. Process 4037 has 15.77 GiB memory in use. Of the allocated memory 14.38 GiB is allocated by PyTorch, and 147.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior

如何解决这个显存问题

System Info

阿里云: 镜像配置镜像名称 modelscope:1.13.3-pytorch2.1.2tensorflow2.14.0-gpu-py310-cu121-ubuntu22.04 镜像 ID image-pn44enfzcrn4c7wloz 镜像 URL dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.13.3-pytorch2.1.2tensorflow2.14.0-gpu-py310-cu121-ubuntu22.04 资源配置资源配额公共资源组规格名称 ecs.gn6v-c8g1.2xlarge CPU 8 内存 32 GiB GPU 1 型号 NVIDIA V100