gpt4all Python binding streaming is not realtime

System Info

Hi! I have a big problem with the gpt4all python binding. Your generator is not actually generating the text word by word, it is first generating every thing in the background then stream it word by word. And that's bad. As I have to wait too long for long outputs and I can't use my hallucination suppression system to prevent when the model starts talking to itself which results to very slow ui.

Information

[ ] The official example notebooks/scripts
[ ] My own modified scripts

Related Components

[ ] backend
[X] bindings
[ ] python-bindings
[ ] chat-ui
[ ] models
[ ] circleci
[ ] docker
[ ] api

Reproduction

Create a discussion that the model can produce a long conversation. You'll see that the time it takes to start generating is very long then you get every thing in generation format but not really as you find out that the generation has already generated more text than the one you are looking for as when you stop in the middle you recover the full text at the end. Here is my code :

            output = ""
            for tok in self.model.generate(prompt, 
                                           n_predict=n_predict,                                           
                                            temp=self.config['temperature'],
                                            top_k=self.config['top_k'],
                                            top_p=self.config['top_p'],
                                            repeat_penalty=self.config['repeat_penalty'],
                                            repeat_last_n = self.config['repeat_last_n'],
                                            # n_threads=self.config['n_threads'],
                                            streaming=True,
                                           ):
                output += tok
                if new_text_callback is not None:
                    if not new_text_callback(tok):
                        return output

Expected behavior

Have the wordds come as they are generated, which means that the generation should start immediately and shouldn't depend on the generated text length.

May 24 '23 15:05 ParisNeo

Take a peek at issue #568. The generate() method is not a Python generator. Your code is iterating over the characters in the emitted string, not the emitted tokens.

I was able to get at the underlying model's generator by setting my own function as the token generation callback, and then calling myGptModel.model.generate() with my prompt/response sequence having been passed through the prompt templating routine.

It's a fierce hack, but it will work until a more elegant API gets put into the bindings.

May 26 '23 16:05 handshape

@handshape Would you be able to ellaborate? Particularly on how you accessed and set the token generation callback function.

I tried following your suggestions, but i'm most certain i messed up something along the way, as my call to "*.generate(...)" results in some low level error (see below)

gpt4all-backend/llama.cpp/ggml.c:8694: nb1 <= nb2

I reckon if i have to touch the lowest levels, i am already on the wrong path

May 27 '23 10:05 MJakobs97

A little demo:

from gpt4all import GPT4All
import sys

model = GPT4All('ggml-mpt-7b-chat')
message = sys.sysv[1]
messages = []
print( "Prompt: " + message )
messages.append({"role": "user", "content": message});
full_prompt = model._build_prompt(messages, True, True)
response_tokens = [];
def local_callback(token_id, response):
    decoded_token = response.decode('utf-8')
    response_tokens.append( decoded_token );

    # Do whatever you want with decoded_token here.

    return True

model.model._response_callback = local_callback
model.model.generate(full_prompt, streaming=False)
response = ''.join(response_tokens)
print ( "Response: " + response );
messages.append({"role": "assistant", "content": response});

# At this point, you can get another prompt from the user, re-run "_build_prompt()", and continue the conversation.

May 27 '23 21:05 handshape

That's way more straight-forward than my tinkering with the underlying ctype functions. Returning tokens is way closer to "realtime" now (ignoring the initial "Startup time").

May 28 '23 19:05 MJakobs97

response_tokens

Thank you very much. Now it is updated and it is working fine on my ui: https://github.com/ParisNeo/gpt4all-ui

Just a little question. Your callback function returns True. If I return False, would that stop the generation process? I use this alot as a way to stop generation for other bindings.

For my Langchain integration, I need each binding to give me access to tokenize and detokenize functions. Is that available on your side?

If not, can I make a feature request?

May 29 '23 14:05 ParisNeo

Seems like this issue has been solved!

Please always feel free to open more issues as needed.

Aug 14 '23 11:08 niansa

gpt4all gpt4all copied to clipboard

Python binding streaming is not realtime

System Info

Information

Related Components

Reproduction

Expected behavior

gpt4all
gpt4all copied to clipboard