llama-cpp-python issues

Workflow update

23

Add CPU with AVX, AVX2, AVX512 with OpenBlas & Remove unnecesary 32 bits wheels - Without AVX Utuntu, Windows => 32 bits, mac => 64 bits - AVX : Ubuntu,...

Smartappli

LLaMA cpp python server: IPV6 support

3

Replace uvicorn by hypercorn to support ipv6

Smartappli

Cache misses previous generation

2

# Expected Behavior The server should cache both the previous prompt and the last generation. # Current Behavior The cache misses at the end of the previous prompt, forcing to...

ultoris

OpenAI v1 Compatibility in Python API

2

With the update to v1 OpenAI's API changed significantly, while backwards compatibility was straightforward to preserve on the server the python API is lagging. The main difference in the pre...

abetlen

enhancement

Add cancel() method to interrupt a stream

8

Fixes #599. Thanks for all your work on this project!

simonchatts

Avoid duplicate special tokens in chat formats

Having multiple BOS can ruin generation, this would occur in several ways, usually through user adding them unnecessarily, in this case remove first token if we detect two in a...

CISC

Not selecting the tokens with the highest probabilities with temperature 0

6

# Prerequisites Please answer the following questions for yourself before submitting an issue. - [x] I am running the latest code. Development is very rapid so there are no tagged...

elhambb

bug

Models are failing to be properly unloaded and freeing up VRAM

5

# Expected Behavior From the issue #302 , I expected the model to be unloaded with the following function: ``` def unload_model(): global llm llama_free_model(llm) # Delete the model object...

Baquara

bug

Llama.generate: prefix-match hit is very slow.

3

I upgraded from an older version, and experienced a disturbingly long read-ahead time. The load on my machine is about the same (a bit higher with python, but that's understandable)...

ndy200

Support multiple chat templates - step 2

2

Next step towards #1336 adds a new parameter to be able to pass arbitrary arguments to the template, much like transformers, except through an explicit parameter instead of just plain...

CISC

llama-cpp-python
llama-cpp-python copied to clipboard

Metadata

Workflow update

LLaMA cpp python server: IPV6 support

Cache misses previous generation

OpenAI v1 Compatibility in Python API

Add cancel() method to interrupt a stream

Avoid duplicate special tokens in chat formats

Not selecting the tokens with the highest probabilities with temperature 0

Models are failing to be properly unloaded and freeing up VRAM

Llama.generate: prefix-match hit is very slow.

Support multiple chat templates - step 2

← Metadata

Owner

Metadata

llama-cpp-python llama-cpp-python copied to clipboard

Metadata

← Metadata

Owner

Metadata

llama-cpp-python
llama-cpp-python copied to clipboard