semantic-kernel icon indicating copy to clipboard operation
semantic-kernel copied to clipboard

Split text by Tokens

Open zs-dima opened this issue 1 year ago • 3 comments

Might be nice to add the ability to split text by Tokens. Currently LLMs have limitations, for example GPT4 allows just 8192 tokens maximum. In case of larger text it might be nice to split it according to the LLM tokens limitation. LangChain has a RecursiveCharacterTextSplitter function, although it splits by characters instead of tokens which is not useful for LLMs text slitting.

For example, with tiktoken we can split text in the way shown below, but might be nice to include a similar function to the Semantic Kernel.

public static List<string> SplitText(this string text, int maxTokens = 1024)
{
    var encoding = GptEncoding.GetEncoding("cl100k_base");

    var tokenizedText = encoding.Encode(text.Trim());
    var chunks = new List<string>();
    var currentChunk = new List<int>();
    int currentLength = 0;

    foreach (var token in tokenizedText)
    {
        currentChunk.Add(token);
        currentLength++;

        if (currentLength >= maxTokens)
        {
            chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
            currentChunk.Clear();
            currentLength = 0;
        }
    }

    if (currentChunk.Count > 0)
    {
        chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
    }

    return chunks;
}

zs-dima avatar May 26 '23 15:05 zs-dima

@1983Thai thanks, do you know any related C# commits ?

zs-dima avatar May 27 '23 10:05 zs-dima

There are some line and paragraph splitting implementations that you can use today in SK, ie: https://github.com/microsoft/semantic-kernel/blob/main/dotnet/src/SemanticKernel/Text/TextChunker.cs

Another library we've seen used is Blingfire, I've used this in a little utility project over here: https://github.com/craigomatic/sk-ingest/blob/e6be94ecfaae03e02addda58aad0f51eacf59a31/Data/SummaryTransform.cs#L37

Neither of these are splitting based on tokens however.

Can you highlight some of the advantages that token based splitting would provide?

craigomatic avatar Jun 02 '23 23:06 craigomatic

@craigomatic

I share the same perspective as @zs-dima's on this issue. It seems there's an existing comment in the TokenCount function of TextChunker.cs that might be worth looking at:

// TODO: partitioning methods should be configurable to allow for different tokenization strategies depending on the model to be called. For now, we use an extremely rough estimate.

Moreover, OpenAI suggests some advantages to token splitting. here.

Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).

We could potentially leverage open-source C# libraries like SharpToken or GPT Tokenizer to tackle this.

If you agree this is a valuable addition, I'd be more than happy to work on a PR to implement this.

MonsterCoder avatar Jun 04 '23 04:06 MonsterCoder

Makes sense - some related discussion over at #478

@dluc is the prior work something that the community could continue or something the SK team will own?

craigomatic avatar Jun 05 '23 17:06 craigomatic

I believe this has been fixed with #2146 . Closing.

lemillermicrosoft avatar Sep 08 '23 19:09 lemillermicrosoft