llama-cpp-python
llama-cpp-python copied to clipboard
when depoly the llava-cpp-pyton server in k8s as a service , it can only answer questions about the first image
@abetlen Hello, when I use python -m llama_cpp.server deployed a llava13b service on the Kubernetes platform, I noticed an issue where only the first image could be correctly returned. When I switched to another image, the responses became completely confused. The only solution was to restart the service and resubmit the query. I suspect that the VRAM is not being released properly, and subsequent images are not being parsed correctly but rather seem to merge with previous images before being fed into the model, leading to highly inaccurate responses. This problem seems to still be related to VRAM. How should I resolve this issue?