llama-cpp-python Add cancel() method to interrupt a stream

Fixes #599.

Thanks for all your work on this project!

Sep 18 '23 10:09 simonchatts

please accept this pr @abetlen

Oct 21 '23 20:10 tk-master

Actually.. I found an issue with this method.. this will only cancel after a token is generated but if the llm is slow or gets stuck processing the prompt, this doesn't cancel it..

We need a better method.

Oct 29 '23 20:10 tk-master

I'm coming back to this because I need to figure out a better method to interrupt the generation programmatically..

For a console-based scenario it's pretty easy in python, all I have to do is surround the code with try except KeyboardInterrupt: .. then I can just press ctrl+c at any point to gracefully interrupt the llm..

But.. if I'm using a front-end user interface, I haven't managed to make it work properly let's say with a button "Stop generating" that can call a python function.. because of the issue I mentioned in the previous post..

@abetlen sorry to bother again but do you have any suggestions/ideas on how to accomplish this?

Nov 13 '23 12:11 tk-master

Why not add it now and improve if there is a better solution. For now this would work in most cases.

May 06 '24 13:05 woheller69

has anyone found a reasonable solution for this? Or am I the only one not willing to wait until the model finishes without killing the job and losing context?

May 08 '24 04:05 woheller69

Any chance this gets merged for now?

May 11 '24 05:05 jewser

It indeed blocks until the first token is produced, but cancelling it after that is trivial. The other similar issue is cancelling a model that is loading.

May 11 '24 13:05 madprops

gpt4all python bindings offer a similar way which allows stopping with the next token

May 11 '24 13:05 woheller69