text-splitter
text-splitter copied to clipboard
Balanced Chunks
Motivation:
Right now, the chunking uses a greedy algorithm. The following would output the following chunks:
let text = "Sentence 1. Sentence 2. Sentence 3. Sentence 4.";
splitter.chunks(text, 36) == ["Sentence 1. Sentence 2. Sentence 3.", "Sentence 4."]
This may not always be desirable, since it can leave "orphaned" elements at the end.
In some cases it would be better to realize at this semantic level there is a more ideal split of:
["Sentence 1. Sentence 2.", "Sentence 3. Sentence 4."]
Finding this is not straightforward in all cases, I attempted it in the past, but at least that attempt led to the algorithm generating several more chunks, rather than finding the best split point. Because tokenization isn't always predictable, there may need to be some allowance for extra chunks being generated, but ideally we can find good split points within the current number of chunks.
Todo:
- [ ]
TextSplitter
andMarkdownSplitter
should both have an opt-in method of enabling balanced chunking (this behavior may be better in all scenarios, but it is unclear, and we need to probably pour over the snapshot diffs to see how much of a difference it makes). - [ ] Ideally, the number of chunks generated is the same, just the split points move around. Some tolerance would be allowable, but ideally we can strive to keep this minimal.