llama-cpp-python After choosing to offload all layers onto the GPU, the Ram used for model loading is not released

After choosing to offload all layers onto the GPU, the Ram used for model loading is not released

Open MATII13T opened this issue 8 months ago • 0 comments

My graphics card was RTX3060 12G, the model used was Qwen2.5-7B-instruct-Q4_k_M, normally the model should only take up 4~5G VRam, so I thought the VRam of my GPU was sufficient to handle the quantization model, but I found that my Ram was occupied all the time. The amount of Ram used by each application in windows Task Manager is inconsistent with the actual total Ram usage in Windows Task Manager, and it will not be released until I finish the python script, is the Ram usage necessary or is it just a BUG?

Mar 07 '25 04:03 MATII13T

llama-cpp-python llama-cpp-python copied to clipboard

After choosing to offload all layers onto the GPU, the Ram used for model loading is not released

llama-cpp-python
llama-cpp-python copied to clipboard