llama-cpp-python GPU memory released for llava multimodal

GPU memory released for llava multimodal

Open adogwangwang opened this issue 9 months ago • 2 comments

When I start the llava13b model using the llama-cpp-python server, I notice that the GPU memory usage increases a little after each inference, which suggests that the GPU memory is not being released after each call. How should this be resolved? hope your help !!

May 13 '24 08:05 adogwangwang

@adogwangwang could you provide more info on which backend (I'm assuming CUDA not Metal) and which version you're running.

May 13 '24 13:05 abetlen

@adogwangwang could you provide more info on which backend (I'm assuming CUDA not Metal) and which version you're running. hello, I am useing llama-cpp-python 0.2.64, when I run llava 1.5 13B multimodal, here is my command:

When I use the llava13b model, I notice that after each inference, there is an increase in the GPU memory usage, which suggests that the memory is not being released after each inference. Additionally, after multiple inferences, the model starts to give erratic responses. I've tried asking questions without adding any images, and surprisingly, the responses were related to the previous image. This indicates that the image variables were not cleared, leading to the unreleased memory and subsequent response confusion. I hope to get some clarification on this issue !!! @abetlen

May 14 '24 06:05 adogwangwang

This happens with me with eris prime punch 9b on 4090 using cuda.

Jun 07 '24 20:06 LankyPoet

llama-cpp-python llama-cpp-python copied to clipboard

GPU memory released for llava multimodal

llama-cpp-python
llama-cpp-python copied to clipboard