Speedway1 comments

Results 35 comments of


                                            Speedway1

CUDA out of memory

> We now add a low GPU Memory cost version, it was tested on a machine with 24GB GPU memory (Tesla A10) and 30GB RAM and is expected to work...

[Bug]: response_json["eval_count"] doesn't exists - llms/ollama.py

While we are waiting for an official fix, you can do this: edit the file: lib/python3.11/site-packages/litellm/llms/ollama.py At line 322 change this: completion_tokens = response_json["eval_count"] to be this: completion_tokens = response_json.get("eval_count",...

bash disabled or not supported with --local

Same here. Got the same issue.

bash disabled or not supported with --local

I have solved the bug and raised bug ticket #867. However, for those in a rush, here is a quick and dirty fix until the patch/bug fix is applied: 1)...

bug: Memory usage is increased with each request

Same here. RAM is not freed up on the GPU between completions. Using SOLAR-10.7B-Instruct-v1.0-AWQ on a 24GB RTX 4090. Starts off with 20444MiB / 24564MiB but within 4 to 10...

bug: Memory usage is increased with each request

THis can be tested simply as follows: 1) Start up openllm 2) use nvidia-smi to monitor the GPU usage 3) use "openllm query" to send about 10 or 20 requests...

bug: Memory usage is increased with each request

I've now tested Ollama and it doesn't have this issue. The memory consumption remains unchanged between calls. Testing with llamacpp server extension core dumps, so it's worse. This is probably...

pip install (as per docs) fails with ModuleNotFoundError: No module named 'axolotl'

Actually the docs are wrong. You need to git clone the repo, then cd into the repo. Then do pip install -e '.[flash-attn,deepspeed]' and you will almost certainly need: pip...

GGML_ASSERT: llama.cpp:3817: unicode_cpts_from_utf8(word).size() > 0

To confirm that we're also seeing the exact same issue.

GGML_ASSERT: llama.cpp:3817: unicode_cpts_from_utf8(word).size() > 0

Two git issues were raised, with the problem in the code identified, but automatically closed due to inactivity: https://github.com/ggerganov/llama.cpp/issues/5112 https://github.com/ggerganov/llama.cpp/issues/4360 Looks like the bug is the handling of token 354...