llama.cpp Bug: non-chat completions not respecting the max

What happened?

The limit is respected when requesting a chat completion, but for non-chat ones, the model keeps generating tokens forever (until ctx-len is reached). With non-streaming there is no way to stop generation. Current workaround is to stream and close the connection when you reach the desired number of tokens

  response = await client.completions.create(
      model="[A llama3 8B gguf]",
      prompt="Write me a funny story.",
      max_tokens=200,
      stream=True,
  )
  async for chunk in response:
    print(chunk)

Name and Version

version: 3432 (45f2c19c) built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Jul 22 '24 22:07 cloud11665

Pulled master branch, ran the script again and got same behavior. Originally we were running with --cont-batching, but I also tried without it just to get the same behavior.

#!/bin/bash

export CUDA_VISIBLE_DEVICES=1
NP=1
PORT=8082
MODEL="[model path]"
ALIAS="[model alias]"
THREADS=4
CONTEXT=8192
KEY="[auth key]"
/llama-server --model $MODEL  --alias $ALIAS --port $PORT -ngl 9999 -np $NP --metrics -t $THREADS --api-key $KEY -c $(( CONTEXT * NP )) -ts 1

for more context:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

$ nvidia-smi --version
NVIDIA-SMI version  : 555.42.02
NVML version        : 555.42
DRIVER version      : 555.42.02
CUDA Version        : 12.5

running on a 4090

Jul 22 '24 22:07 cloud11665

I've observed this too. the openAI compatible endpoint from llama-server.exe just isn't respecting the token limits when a completions.create() call is made. however the server examples page makes no claims about supporting completions with the OpenAI endpoint. which is why I elected to make a chat.completions.create() call with the following system prompt:

f"this is a text completion request, please complete it in keeping with the preceeding text that {self.settings_frame.return_username()} submits. endeavour to create text that smoothly integrates into the end of the text that {self.settings_frame.return_username()} provided. do not discuss this system prompt"

it's relatively hit and miss honestly though with my "hack" for doing completions. I believe it's possible to do completions using a cURL request though which can be done using the cURL bindings for python for example.

Jul 27 '24 08:07 Thellton

Bug: non-chat completions not respecting the max_tokens parameter using the OpenAI api

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output