gpt4all
gpt4all copied to clipboard
Python binding streaming is not realtime
System Info
Hi! I have a big problem with the gpt4all python binding. Your generator is not actually generating the text word by word, it is first generating every thing in the background then stream it word by word. And that's bad. As I have to wait too long for long outputs and I can't use my hallucination suppression system to prevent when the model starts talking to itself which results to very slow ui.
Information
- [ ] The official example notebooks/scripts
- [ ] My own modified scripts
Related Components
- [ ] backend
- [X] bindings
- [ ] python-bindings
- [ ] chat-ui
- [ ] models
- [ ] circleci
- [ ] docker
- [ ] api
Reproduction
Create a discussion that the model can produce a long conversation. You'll see that the time it takes to start generating is very long then you get every thing in generation format but not really as you find out that the generation has already generated more text than the one you are looking for as when you stop in the middle you recover the full text at the end. Here is my code :
output = ""
for tok in self.model.generate(prompt,
n_predict=n_predict,
temp=self.config['temperature'],
top_k=self.config['top_k'],
top_p=self.config['top_p'],
repeat_penalty=self.config['repeat_penalty'],
repeat_last_n = self.config['repeat_last_n'],
# n_threads=self.config['n_threads'],
streaming=True,
):
output += tok
if new_text_callback is not None:
if not new_text_callback(tok):
return output
Expected behavior
Have the wordds come as they are generated, which means that the generation should start immediately and shouldn't depend on the generated text length.
Take a peek at issue #568. The generate() method is not a Python generator. Your code is iterating over the characters in the emitted string, not the emitted tokens.
I was able to get at the underlying model's generator by setting my own function as the token generation callback, and then calling myGptModel.model.generate() with my prompt/response sequence having been passed through the prompt templating routine.
It's a fierce hack, but it will work until a more elegant API gets put into the bindings.
@handshape Would you be able to ellaborate? Particularly on how you accessed and set the token generation callback function.
I tried following your suggestions, but i'm most certain i messed up something along the way, as my call to "*.generate(...)" results in some low level error (see below)
gpt4all-backend/llama.cpp/ggml.c:8694: nb1 <= nb2
I reckon if i have to touch the lowest levels, i am already on the wrong path
A little demo:
from gpt4all import GPT4All
import sys
model = GPT4All('ggml-mpt-7b-chat')
message = sys.sysv[1]
messages = []
print( "Prompt: " + message )
messages.append({"role": "user", "content": message});
full_prompt = model._build_prompt(messages, True, True)
response_tokens = [];
def local_callback(token_id, response):
decoded_token = response.decode('utf-8')
response_tokens.append( decoded_token );
# Do whatever you want with decoded_token here.
return True
model.model._response_callback = local_callback
model.model.generate(full_prompt, streaming=False)
response = ''.join(response_tokens)
print ( "Response: " + response );
messages.append({"role": "assistant", "content": response});
# At this point, you can get another prompt from the user, re-run "_build_prompt()", and continue the conversation.
That's way more straight-forward than my tinkering with the underlying ctype functions. Returning tokens is way closer to "realtime" now (ignoring the initial "Startup time").
response_tokens
Thank you very much. Now it is updated and it is working fine on my ui: https://github.com/ParisNeo/gpt4all-ui
Just a little question. Your callback function returns True. If I return False, would that stop the generation process? I use this alot as a way to stop generation for other bindings.
For my Langchain integration, I need each binding to give me access to tokenize and detokenize functions. Is that available on your side?
If not, can I make a feature request?
Seems like this issue has been solved!
Please always feel free to open more issues as needed.