langchainjs
langchainjs copied to clipboard
Streaming causes LLM to always start answers with a rephrased version of the question
Describe the issue
When I enable streaming on the OpenAI model, it causes all answers to begin with a rephrased version of the question.
Example:
Question from user: "How far is the sun?"
Answer streamed from handleLLMNewToken
: "What is the distance from Earth to the sun? I don't know."
It's worth noting that the answer streamed token-by-token by handleLLMNewToken
is different than the response return by await chain.call
. The latter returns { text: " I don't know." }
, which is the desired behavior. The problem is that this value can't be streamed. As far as I know the streaming needs to happen from inside handleLLMNewToken
like this:
const sendData = (data: string) => {
res.write(`data: ${data}\n\n`);
};
const model = new OpenAI({
openAIApiKey: process.env.OPENAI_API_KEY,
streaming: true,
callbackManager: CallbackManager.fromHandlers({
async handleLLMNewToken(token: string) {
console.log('handleLLMNewToken', token);
sendData(JSON.stringify({ data: token })); // stream each token
},
async handleLLMStart(llm: any, prompts: string[]) {
console.log('handleLLMStart');
},
}),
});
Environment
"langchain": "^0.0.51", "next": "13.3.0",
I am getting this as well for the second and third etc. questions that I ask. Was trying to follow the lex-gpt repo as a guide for edge streaming.
This seems to be due to using ConversationalRetrievalQAChain
and chat history. handleLLMNewToken
and handleLLMEnd
are called twice, breaking streaming. Things seem to be working fine when I use RetrievalQAChain
.
Related #842
happens here too.
Having this issue as well after updating to 0.0.61
. Looks like handleLLMNewToken
when passed to the OpenAI LLM callbacks
is returning the tokens from the "standalone question" , not the final result.
Getting the same issue
Related #603
I assume you're using the conversational QA chain, from the docs
It first combines the chat history and the question into a standalone question, then looks up relevant documents from the retriever, and then passes those documents and the question to a question answering chain to return a response.
So i guess it's prompting the LLM to get a summary question that's then fed (with docs) to the LLM again?
Work around i've been using
class EventsHandler extends BaseCallbackHandler {
private writer: WritableStreamDefaultWriter<string>;
private stream: WritableStream<string>;
private question: string;
private LLMCount = 0;
constructor({ stream, question }: { stream: WritableStream<string>; question: string }) {
super();
this.writer = stream.getWriter();
this.stream = stream;
this.question = question;
}
isContextQuestion() {
return this.LLMCount === 0 && chatHistory.size;
}
async handleLLMNewToken(token: string) {
if (!this.isContextQuestion()) {
await this.writer.ready;
return this.writer.write(token);
}
}
async handleLLMEnd(output: ChainValues) {
const result = output?.generations?.[0]?.[0]?.text;
this.LLMCount = this.LLMCount + 1;
}
async handleLLMError(err: unknown) {
this.writer.releaseLock();
await this.stream.abort((err as Error).message);
}
}
Work around i've been using
class EventsHandler extends BaseCallbackHandler { private writer: WritableStreamDefaultWriter<string>; private stream: WritableStream<string>; private question: string; private LLMCount = 0; constructor({ stream, question }: { stream: WritableStream<string>; question: string }) { super(); this.writer = stream.getWriter(); this.stream = stream; this.question = question; } isContextQuestion() { return this.LLMCount === 0 && chatHistory.size; } async handleLLMNewToken(token: string) { if (!this.isContextQuestion()) { await this.writer.ready; return this.writer.write(token); } } async handleLLMEnd(output: ChainValues) { const result = output?.generations?.[0]?.[0]?.text; this.LLMCount = this.LLMCount + 1; } async handleLLMError(err: unknown) { this.writer.releaseLock(); await this.stream.abort((err as Error).message); } }
Can you elaborate how to implement this workaround? I have not been successful yet
We're looking into the best way of solving this, should have more info here soon
The immediate fix for this issue is to do the following
// construct your chain as before
const chain = ConversationalRetrievalQAChain.fromLLM(new ChatOpenAI({streaming: true, ...}), ...)
// after creating the chain override the LLM in the inner `questionGeneratorChain`
chain.questionGeneratorChain.llm = new ChatOpenAI()
// use the chain
We're working on a better solution
^ To elaborate on the above explanation, the key is to ensure that the llm behind the questionGeneratorChain
has streaming
equal to false
and that the LLM you pass into fromLLM
has a handler that implements handleLLMNewToken
(passed in as part of callbacks
)
@nfcampos I'm assuming this workaround wont work in the context of an agent executor (without reimplementing a lot)? As it seems to use the same LLM for each tool call? I was able to put together a hacky filter to get the final output tokens from the callback for anyone interested:
import { BaseCallbackHandler } from 'langchain/callbacks';
import { EventController } from './event_controller';
export class CallbackHandler extends BaseCallbackHandler {
name = 'CallbackHandler';
stream = new EventController<string>();
private _buffer = '';
private _isInsideActionInput = false;
private _hasFoundFinalAnswer = false;
private readonly _finalAnswerString = '"action": "Final Answer"';
private readonly _outputString = '"action_input": "';
handleLLMEnd() {
if (this._hasFoundFinalAnswer) {
this.stream.end();
}
}
handleAgentEnd() {
this.stream.end();
}
handleLLMNewToken(token: string) {
this._buffer += token;
if (this._isInsideActionInput) {
this._streamBuffer();
} else if (!this._hasFoundFinalAnswer) {
this._findFinalAnswer();
} else {
this._findOutput();
}
}
private _findOutput() {
const startIndex = this._buffer.indexOf(this._outputString);
if (startIndex !== -1) {
this._isInsideActionInput = true;
this._buffer = this._buffer.slice(startIndex + this._outputString.length);
} else {
this._buffer = this._buffer.slice(-this._outputString.length + 1);
}
}
private _findFinalAnswer() {
const isFinalIndex = this._buffer.indexOf(this._finalAnswerString);
if (isFinalIndex !== -1) {
this._hasFoundFinalAnswer = true;
this._buffer = '';
} else {
this._buffer = this._buffer.slice(-this._finalAnswerString.length + 1);
return;
}
}
private _streamBuffer() {
const endIndex = this._buffer.search(/(?<!\\)"/);
if (endIndex !== -1) {
const value = this._buffer.slice(0, endIndex);
this.stream.publish(value);
this._isInsideActionInput = false;
this._buffer = this._buffer.slice(endIndex + 1);
} else {
this.stream.publish(this._buffer);
this._buffer = '';
}
}
}
Seeing the same issue, applied the workaround and seems to work. But since it is a non-streaming openai client, it triggers the error Refused to set unsafe header "User-Agent"
for me (I'm working on an Electron app that calls OpenAI directly). The streaming configuration does not trigger this error.
In the chain's handlers, I differentiate the source of tokens based on the chain's name to determine which tokens should be sent to the client. Below is an example:
const handlers = BaseCallbackHandler.fromMethods({
handleChainStart(chain) {
if (chain.name === 'stuff_documents_chain') {
isStuffDocumentsChain = true
}
},
handleLLMNewToken(token) {
if (isStuffDocumentsChain) {
subscriber.next(token)
}
},
handleChainEnd() {
isStuffDocumentsChain = false
},
})
Any update on this one? It's still giving me rephrased questions.
If you're looking to stream the final result using the ConversationRetrievalQAChain() along with Pinecone, I've set up a deployed example that demonstrates this functionality. You can check it out here: Example.
GitHub Code : Code.
Furthermore, if you encounter any difficulties while working with the ConversationalRetrievalQAChain(), Please provide the relevant repository or complete context that you're working with. This will help quickly identify and address any bugs or issues you might be facing, ensuring a more efficient debugging process.
Hi, @jacob-ruiz! I'm Dosu, and I'm helping the langchainjs team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you reported is about enabling streaming on the OpenAI model, which causes all answers to start with a rephrased version of the question. Users have reported this issue and provided workarounds, such as overriding the LLM in the inner questionGeneratorChain
. The maintainers have acknowledged the issue and provided a temporary fix by overriding the LLM. They are also actively working on a better solution.
Now, I'd like to ask if this issue is still relevant to the latest version of the langchainjs repository. If it is, please let the langchainjs team know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your contribution to langchainjs! Let me know if there's anything else I can assist you with.