Speedway1

Results 35 comments of Speedway1

> We now add a low GPU Memory cost version, it was tested on a machine with 24GB GPU memory (Tesla A10) and 30GB RAM and is expected to work...

While we are waiting for an official fix, you can do this: edit the file: lib/python3.11/site-packages/litellm/llms/ollama.py At line 322 change this: completion_tokens = response_json["eval_count"] to be this: completion_tokens = response_json.get("eval_count",...

Same here. Got the same issue.

I have solved the bug and raised bug ticket #867. However, for those in a rush, here is a quick and dirty fix until the patch/bug fix is applied: 1)...

Same here. RAM is not freed up on the GPU between completions. Using SOLAR-10.7B-Instruct-v1.0-AWQ on a 24GB RTX 4090. Starts off with 20444MiB / 24564MiB but within 4 to 10...

THis can be tested simply as follows: 1) Start up openllm 2) use nvidia-smi to monitor the GPU usage 3) use "openllm query" to send about 10 or 20 requests...

I've now tested Ollama and it doesn't have this issue. The memory consumption remains unchanged between calls. Testing with llamacpp server extension core dumps, so it's worse. This is probably...

Actually the docs are wrong. You need to git clone the repo, then cd into the repo. Then do pip install -e '.[flash-attn,deepspeed]' and you will almost certainly need: pip...

To confirm that we're also seeing the exact same issue.

Two git issues were raised, with the problem in the code identified, but automatically closed due to inactivity: https://github.com/ggerganov/llama.cpp/issues/5112 https://github.com/ggerganov/llama.cpp/issues/4360 Looks like the bug is the handling of token 354...