llama-cpp-python
llama-cpp-python copied to clipboard
GPU memory released for llava multimodal
When I start the llava13b model using the llama-cpp-python server, I notice that the GPU memory usage increases a little after each inference, which suggests that the GPU memory is not being released after each call. How should this be resolved? hope your help !!
@adogwangwang could you provide more info on which backend (I'm assuming CUDA not Metal) and which version you're running.
@adogwangwang could you provide more info on which backend (I'm assuming CUDA not Metal) and which version you're running. hello, I am useing llama-cpp-python 0.2.64, when I run llava 1.5 13B multimodal, here is my command:
When I use the llava13b model, I notice that after each inference, there is an increase in the GPU memory usage, which suggests that the memory is not being released after each inference. Additionally, after multiple inferences, the model starts to give erratic responses. I've tried asking questions without adding any images, and surprisingly, the responses were related to the previous image. This indicates that the image variables were not cleared, leading to the unreleased memory and subsequent response confusion. I hope to get some clarification on this issue !!! @abetlen
This happens with me with eris prime punch 9b on 4090 using cuda.