semantic-kernel
semantic-kernel copied to clipboard
Split text by Tokens
Might be nice to add the ability to split text by Tokens. Currently LLMs have limitations, for example GPT4 allows just 8192 tokens maximum. In case of larger text it might be nice to split it according to the LLM tokens limitation. LangChain has a RecursiveCharacterTextSplitter function, although it splits by characters instead of tokens which is not useful for LLMs text slitting.
For example, with tiktoken we can split text in the way shown below, but might be nice to include a similar function to the Semantic Kernel.
public static List<string> SplitText(this string text, int maxTokens = 1024)
{
var encoding = GptEncoding.GetEncoding("cl100k_base");
var tokenizedText = encoding.Encode(text.Trim());
var chunks = new List<string>();
var currentChunk = new List<int>();
int currentLength = 0;
foreach (var token in tokenizedText)
{
currentChunk.Add(token);
currentLength++;
if (currentLength >= maxTokens)
{
chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
currentChunk.Clear();
currentLength = 0;
}
}
if (currentChunk.Count > 0)
{
chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
}
return chunks;
}
@1983Thai thanks, do you know any related C# commits ?
There are some line and paragraph splitting implementations that you can use today in SK, ie: https://github.com/microsoft/semantic-kernel/blob/main/dotnet/src/SemanticKernel/Text/TextChunker.cs
Another library we've seen used is Blingfire, I've used this in a little utility project over here: https://github.com/craigomatic/sk-ingest/blob/e6be94ecfaae03e02addda58aad0f51eacf59a31/Data/SummaryTransform.cs#L37
Neither of these are splitting based on tokens however.
Can you highlight some of the advantages that token based splitting would provide?
@craigomatic
I share the same perspective as @zs-dima's on this issue. It seems there's an existing comment in the TokenCount
function of TextChunker.cs that might be worth looking at:
Moreover, OpenAI suggests some advantages to token splitting. here.
Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).
We could potentially leverage open-source C# libraries like SharpToken or GPT Tokenizer to tackle this.
If you agree this is a valuable addition, I'd be more than happy to work on a PR to implement this.
Makes sense - some related discussion over at #478
@dluc is the prior work something that the community could continue or something the SK team will own?
I believe this has been fixed with #2146 . Closing.