Bug: non-chat completions not respecting the max_tokens parameter using the OpenAI api
What happened?
The limit is respected when requesting a chat completion, but for non-chat ones, the model keeps generating tokens forever (until ctx-len is reached). With non-streaming there is no way to stop generation. Current workaround is to stream and close the connection when you reach the desired number of tokens
response = await client.completions.create(
model="[A llama3 8B gguf]",
prompt="Write me a funny story.",
max_tokens=200,
stream=True,
)
async for chunk in response:
print(chunk)
Name and Version
version: 3432 (45f2c19c) built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
Pulled master branch, ran the script again and got same behavior. Originally we were running with --cont-batching, but I also tried without it just to get the same behavior.
#!/bin/bash
export CUDA_VISIBLE_DEVICES=1
NP=1
PORT=8082
MODEL="[model path]"
ALIAS="[model alias]"
THREADS=4
CONTEXT=8192
KEY="[auth key]"
/llama-server --model $MODEL --alias $ALIAS --port $PORT -ngl 9999 -np $NP --metrics -t $THREADS --api-key $KEY -c $(( CONTEXT * NP )) -ts 1
for more context:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
$ nvidia-smi --version
NVIDIA-SMI version : 555.42.02
NVML version : 555.42
DRIVER version : 555.42.02
CUDA Version : 12.5
running on a 4090
I've observed this too. the openAI compatible endpoint from llama-server.exe just isn't respecting the token limits when a completions.create() call is made. however the server examples page makes no claims about supporting completions with the OpenAI endpoint. which is why I elected to make a chat.completions.create() call with the following system prompt:
f"this is a text completion request, please complete it in keeping with the preceeding text that {self.settings_frame.return_username()} submits. endeavour to create text that smoothly integrates into the end of the text that {self.settings_frame.return_username()} provided. do not discuss this system prompt"
it's relatively hit and miss honestly though with my "hack" for doing completions. I believe it's possible to do completions using a cURL request though which can be done using the cURL bindings for python for example.