ChatGLM-6B [BUG/Help]显存足够但是应用启动时只分配了5G,导致模型加载不起来

[BUG/Help]显存足够但是应用启动时只分配了5G,导致模型加载不起来

Open aleimu opened this issue 2 years ago • 2 comments

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

大概就是上图的情况,启动应用后,查看显存详情,显存默认只分配了5G,导致应用无法加载模型,但是主机上是有22G显存的,空余显存很多

Expected Behavior

No response

Steps To Reproduce

可能是特殊情况,没有定位思路

Environment

- OS:unbantu
- Python:python3.9
- Transformers:transformers==4.27.1
- PyTorch:torch==1.10.1 /  torch==2.0.0 也是一样
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Python 3.9.16 (main, Dec  7 2022, 01:11:58)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
>>> gpu_id = torch.cuda.current_device()
>>> torch.cuda.set_device(0)
>>> batch_size = 1024
>>> data_shape = (3, 224, 224)
>>> tensor = torch.zeros([batch_size] + list(data_shape)).cuda(device=0)
>>> batch_size = 10240
>>> tensor = torch.zeros([batch_size] + list(data_shape)).cuda(device=0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA out of memory. Tried to allocate 5.74 GiB (GPU 0; 5.50 GiB total capacity; 588.00 MiB already allocated; 4.13 GiB free; 588.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
>>>
>>> torch.set_default_tensor_type(torch.FloatTensor)
>>> total_memory = torch.cuda.max_memory_allocated()
>>> free_memory = torch.cuda.max_memory_cached()
/home/admin/langchain-ChatGLM/venv/lib/python3.9/site-packages/torch/cuda/memory.py:392: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  warnings.warn(
>>>
>>> print(f"Total memory: {total_memory / (1024 ** 3):.2f} GB")
Total memory: 0.57 GB
>>> print(f"Free memory: {free_memory / (1024 ** 3):.2f} GB")
Free memory: 0.57 GB

Anything else?

https://github.com/pytorch/pytorch/issues/67680

May 10 '23 10:05 aleimu

CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 5.50 GiB total capacity; 4.69 GiB already allocated; 19.00 MiB free; 4.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

May 10 '23 10:05 aleimu

遇到了同样的问题，是否和coda版本有关呢，我的cuda版本也是11.0

Jun 19 '23 08:06 luocj

ChatGLM-6B ChatGLM-6B copied to clipboard

[BUG/Help]显存足够但是应用启动时只分配了5G,导致模型加载不起来

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B
ChatGLM-6B copied to clipboard