langchain
langchain copied to clipboard
Implement a streaming text splitter
Currently, the TextSplitter interface only allows for splitting text into fixed-size chunks and returning the entire list before any queries are run. It would be great if we could split text input dynamically to provide each query step with as large context window as possible, this would make tasks like summarization much more efficient.
It should estimate the number of available tokens for each query and ask the TextSplitter to yield the next chunk of at most X tokens.
Implementing such TextSplitter seems pretty easy (I can contribute it), the harder problem is to integrate the support for it in various places in the library.
Is there a consistent streaming iterable or similar for yielding generations that are in progress? I saw there is a manager class, and some methods return generators but I've not figured out a pattern yet. I have been trying to figure out how to stream generated content as it comes in from new Turbo chat interfaces, and I've made some progress but haven't figured out how to do sequential chains using a yield statement yet. I'm new here and might not have looked very well yet and it's great to see the new chat interfaces coming out fast.
I think you should open a separate issue for that
Yes, I thought iterating on the output as chunks might be related to your issue but I suppose the output interfaces are probably going to remain different from the input. That said, if the input and output had a common iterator pattern it might be helpful to understand and work with them better, if that makes sense.
By the way you said "TextSplitter interface only allows for splitting text into fixed-size chunks" but there are others like NLTKTextSplitter that can split into sentences with a flexible size with a maximum bound, as in you could put in near to the largest size that the model can take, does that help? I realise it might still not be "as large context window as possible". It could be nice to automatically figure all that stuff out, since a tokeniser and character length limits are different and need some logic to predict upper bounds to integrate them. Though I've not looked at the existing summariser methods yet, maybe they already do some of this - if they do, I've not seen it while reading about the splitters. I guess we'll be querying the code with language models soon anyway :)
You don't know the upper bound in advance. Maybe your prompt template is 200 tokens and you can use 3.5k token chunk, or maybe the template+answer+intermediate result is 2k tokens, so any chunk larger than 2k tokens will produce an error.
Hi, @not-poma! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you requested a streaming text splitter that can dynamically split text input into chunks of a specified size. You mentioned that you are willing to contribute the implementation, but there may be challenges in integrating the support for it in various places in the library.
There has been a discussion in the comments about the possibility of having a common iterator pattern for input and output, as well as considerations about the upper bounds of token length. However, the issue remains unresolved.
Could you please let us know if this issue is still relevant to the latest version of the LangChain repository? If it is, please comment on the issue to let us know. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain project!