llama-cpp-python
llama-cpp-python copied to clipboard
After choosing to offload all layers onto the GPU, the Ram used for model loading is not released
My graphics card was RTX3060 12G, the model used was Qwen2.5-7B-instruct-Q4_k_M, normally the model should only take up 4~5G VRam, so I thought the VRam of my GPU was sufficient to handle the quantization model, but I found that my Ram was occupied all the time. The amount of Ram used by each application in windows Task Manager is inconsistent with the actual total Ram usage in Windows Task Manager, and it will not be released until I finish the python script, is the Ram usage necessary or is it just a BUG?