anything-llm
anything-llm copied to clipboard
Incomplete response from LM studio endpoint
I am getting incomplete responses while using the LM studio endpoint. The response cuts off midway while streaming, sometimes after the first word or after half a sentence. I am running on docker.
Hi @Daniel-Dan-Espinoza, which model are you using in LM Studio? This typically happens when models that are less optimized for chatting are being used. Also, are you running LM Studio locally on the same machine as your AnythingLLM docker container?
I was using Starling model. I am running both LM Studio and AnythingLLM docker container on the same machine.
I tried Local AI but have the same issue
When interacting with LM Studio we leave the entire run of inferencing on the LM Studio side. We simply pass along the input and wait for LM Studio to be done with output.
When an inferencing is running Its likely the output being sent to AnythingLLM is not being dropped, but that LM Studio stops generating output and AnythingLLM assumes the response to be done.
Can you ensure that the model is not continuing to generate response when AnythingLLM says the response is complete? This would help determine if the issue is with the model/config on LM Studio or AnythingLLM
@timothycarambat this is the streaming bug fix for localai we added. This is the fix working but we need to learn why its dropping the packets
@lunamidori5 We would need to confirm that the user is running the patched version, and if so then yes for sure need to see why. To be fair, I have yet to replicate this issue with LocalAi (or LM Studio for that matter)
@timothycarambat at least im not the only one with this bug (I am starting to think it maybe the way some routers work...)
Closing as stale
I am seeing this problem, using the latest version of AnythingLLM (0.2.0?). I saw it when using LM Studio, but then it seemed to clear up on its own, or maybe it was after I reset the chat in AnythingLLM. Then I got the empty content complaint from LM Studio, and I decided enough is enough, and I switched to Kobold. Now I am seeing the one token problem using Kobold, via the Local AI LLM setting in AnythingLLM (chat model selection: (which I can't seem to copy from the form, sigh...) koboldcpp/dolphin-2.2.1-mistral-7b.Q5_K_S). Resetting the chat history doesn't help.
I'm fine with disabling the streaming mode, for now. I don't see any way to do that, either in AnythingLLM or Kobold.
Looks like this might be a relevant issue: https://github.com/LostRuins/koboldcpp/issues/669 So it may be a bug on the Kobold side.
@timothycarambat could you add a "no streaming" check mark to the llm screen?
Is it because there is an issue with streaming or because certain models do not support it?
Is it because there is an issue with streaming or because certain models do not support it?
With Kobold, I was seeing the whole stream of tokens being generated, so clearly the model supports streaming, and Kobold supports streaming, but from the AnythingLLM side, it was already done after the first token came in. So it seems like a timeout issue, not waiting long enough for the next token? It seems there should be a special signal that the stream is finished because, otherwise, how would anyone know?
Is it because there is an issue with streaming or because certain models do not support it?
LocalAI and Google Gem Still have that streaming bug from before...