llama-cpp-agent Stop LLM output on user request?

Is there a way to stop inference manually? E.g. such as by returning FALSE to the streaming_callback? If the user presses the stop button in a UI how could that be handled?

May 05 '24 19:05 woheller69

I'm not sure how to do it properply in llama_cpp_python but it should be possible. Will add this ASAP

May 05 '24 20:05 Maximilian-Winter

Is possible to use break keyword or if you use request you can also have a signal control to control finish the request

May 06 '24 04:05 pabl-o-ce

it is not about a keyword. If a long text is generated and it goes the wrong direction I want to stop it without losing the context by killing the process. The python bindings of gpt4all e.g. of have a callback similar to streaming_callback. If True is returned, it continues, if False is returned it stops. In this callback I can check if a button has been pressed an then send True/False.

May 06 '24 04:05 woheller69

I need this for a local model, just in case this makes a difference

May 06 '24 07:05 woheller69

It seems there is a PR for llama-cpp-python regarding this: https://github.com/abetlen/llama-cpp-python/pull/733/files

Add cancel() method to interrupt a stream

But they do not want to merge it

There is also an issue: https://github.com/abetlen/llama-cpp-python/issues/599

May 06 '24 08:05 woheller69

call me a mad man but I just use like this example to end the inference

for chunk in llm.stream_chat(chat_template):
    if cancel_flag is True:
        break

May 07 '24 01:05 pabl-o-ce

Doesn't that just break the for loop but the llm continues to stream?

Currently I have:

    llama_cpp_agent.get_chat_response(
        user_input, 
        temperature=0.7, 
        top_k=40, 
        top_p=0.4,
        repeat_penalty=1.18, 
        repeat_last_n=64, 
        max_tokens=2000,
        stream=True,
        print_output=False,
        streaming_callback=streaming_callback
    )

And in the streaming_callback I am printing the tokens as they come. Ideally this callback could return True/False to continue/stop

May 07 '24 03:05 woheller69

let me create some test for this

May 08 '24 15:05 pabl-o-ce

In case there is no "clean" solution via llama_cpp_python, I found a solution using a thread_with_exception as in my code https://github.com/woheller69/LLAMA_TK_CHAT/

It starts inference in a separate tread and stops it by raising an exception. But that way the partial answer is not added to chat history (I am doing this later using add_message(...) in my code) because I am having llama_cpp_agent.get_chat_response(...) in this thread. It certainly would be better if that was realized INSIDE llama_agent.py, maybe in get_chat_response(...) or get_response_role_and_completion(...) such that the partial answer can still be added to history.

If my code doesn't look great, this is because I have no clue about Python :-)

May 09 '24 06:05 woheller69

For those interested, here is an minimal adaptation of @woheller69's workaround:

from llama_cpp import Llama
import threading
import sys

# https://github.com/woheller69/LLAMA_TK_CHAT/blob/main/LLAMA_TK_GUI.py
class thread_with_exception(threading.Thread):
    def __init__(self, name, callback):
        threading.Thread.__init__(self)
        self.name = name
        self.callback = callback

    def run(self):
        self.callback()

    def get_id(self):
        # returns id of the respective thread
        if hasattr(self, '_thread_id'):
            return self._thread_id
        for id, thread in threading._active.items():
            if thread is self:
                return id

    def raise_exception(self):
        thread_id = self.get_id()
        if thread_id != None:
            res = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), ctypes.py_object(SystemExit))
            if res > 1:
                ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), 0)

llm = Llama(
    model_path="../../llama.cpp/models/Meta-Llama-3-8B/ggml-model-f16.gguf",
    n_gpu_layers=-1,
    lora_path="../../llama.cpp/models/test/my_lora_1350.bin",
    n_ctx=1024,
)

def generate(prompt):
    for chunk in llm(
        ''.join(prompt),
        max_tokens=100,
        stop=["."],
        echo=False,
        stream=True,
    ):
        yield chunk["choices"][0]["text"]

def inference_callback():
    prompt = "juicing is the act of "

    print(prompt,end='')
    sys.stdout.flush()
    for chunk in generate([prompt]):
        print(chunk,end='')
        sys.stdout.flush()
    print()

inference_thread = thread_with_exception("InferenceThread", inference_callback)
inference_thread.start()

import time
try:
    for i in range(20):
        time.sleep(0.5)
    print("done normally")
except KeyboardInterrupt:
    inference_thread.raise_exception()
    inference_thread.join()
    print("interrupted")

Here we have an inference thread that may be interrupted by the main thread which is busy doing something else (presumably listening as a webserver or a gui window or something), though in this case it is just sleeping for 10 seconds.

May 11 '24 05:05 jewser

Using LM Studio to run the models works for me. I often stop the generation, edit the AI mistakes and steer it in the direction I want, save the changes and then have it continue generating. This seems to work for me on all models I have tried while using LM Studio App.

May 16 '24 00:05 42PAL