ollama-python icon indicating copy to clipboard operation
ollama-python copied to clipboard

Add way to abort a streaming response from chat / generate

Open paulrobello opened this issue 1 year ago • 1 comments

I don't see any way to abort a streaming response from chat / generate methods

Is this or will this be supported?

Given the following snippet how can you properly abort the stream? Simply breaking from the for chunk loop is not enough.

           stream = ollama.chat(
                model=model_name,
                messages=messages,
                stream=True,
           )
           msg_content = ""
           for chunk in stream:
               msg_content += chunk["message"]["content"]

paulrobello avatar Jul 15 '24 18:07 paulrobello

Technically there is no way of doing it, as you can see in that issue on llama.cpp repository; the "brute-force" way is to add a KeyboardInterrupt during the stream, for example:

try:
    for chunk in stream:
        print(chunk["message"]["content"], end='')
except KeyboardInterrupt:
    pass

However that approach may cause problems, the main one is that probably the Ollama server will continue sending stuff, however given that the stream is closed it will send chunks to nothing. To better understand what could happen you should have either httpx knowledge (what _client.py uses to make requests) or Go knowledge (what Ollama is built in).

antoninoLorenzo avatar Jul 26 '24 10:07 antoninoLorenzo

exiting the generator is sufficient to abort the request and stop generation, e.g.

           for chunk in stream:
               msg_content += chunk["message"]["content"]
               if should_break():
                   break

mxyng avatar Jul 31 '24 00:07 mxyng

Hi Mxyng, does this solution actually terminate the response generation as you would expect when using the API directly and hitting CTRL + C? If not, we break the stream, but the model is still using resources on the machine until its response is complete. I'm not sure about the python sdk, but the JS one does offer a proper abort method: https://github.com/ollama/ollama-js/blob/main/examples/abort/abort-single-request.ts

WabaScript2 avatar Apr 05 '25 03:04 WabaScript2

Checking in Wireshark, it seems that Ollama Python doesn't actually stop the stream after the iterator is aborted.

It seems that Python intentionally do not stop the stream when you break out of it - you can still start another loop to continue reading from the stream.

whs avatar Jun 22 '25 19:06 whs

I think I've found a workaround.

Simply throw an exception where you'd use break, then outside the async for you catch it:

response = await ollama.AsyncClient().chat(...)
try:
    async for chunk in response:
        raise StopIteration
except StopIteration:
    pass

I can see that RST packets do get sent this way, and the server response with context canceled.

whs avatar Jun 23 '25 02:06 whs

I tried the StopIteration example above; does not work. Creating a task does not work. Break does not work. I need a way to kill a chat when context panics occur.

ryamldess avatar Oct 08 '25 22:10 ryamldess