openai-tools performance improvements,

Hello,

Thank you for your hard work on the project. I made some improvements to the code with help from ChatGPT, and it resulted in a 40% increase in performance. I plan to use the improved version of the tokenizer in my OpenAI SDK project, and I suggest you consider implementing the changes too. All the unit tests passed, but my ReSharper made some unnecessary code changes. I apologize for that.

Additionally, I think changing BYTES_TO_UNICODE_CACHE to a Dictionary may provide a slight performance boost, but I'm not sure if it's safe. I don't have enough knowledge on this topic yet.

my repo: https://github.com/betalgo/openai benchmark1

Feb 24 '23 20:02 kayhantolga

thank you @kayhantolga this is great! :-)

FYI I'm also porting the tokenizer into Semantic Kernel ("SK"), and I will take these improvements over there too, perfect timing. As a plus with SK, the library will be available also with a nuget signed by Microsoft, which is safer for Prod.

The repo here will remain up anyway, and I hope to check in some other work in future.

By the way, since you're looking at this: I found that OpenAI regex seems to be vulnerable to ReDoS. Any chance you looked into that too?

string pat = @"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+";

I'm going to add a timeout to the regex, you might want to copy that too.

Mar 22 '23 07:03 dluc

@kayhantolga I tried the changes with a single big file and I could not see perf changes, so looks like the difference is only about the timing when the cache instances are created. Could you share some info about the stress test you ran?

Mar 23 '23 07:03 dluc

I was thinking of publishing my tests, but I realized they are a mess at the moment. I need to tidy them up, but I attached a screenshot which you can check.😅

My improvements focused on caching regular expressions. This improvement is most noticeable when receiving many requests with small pieces of data.

Unfortunately, I do not have experience with regex, so I can't help with ReDoS.

Another improvement we made in our repository is returning yield results and adding another method that will only return the token count. These changes were not included in the benchmark, but I believe they will also improve performance. (shared code piece)

    /// <summary>
    ///     Encode This method use LF style EOL, if you use CR LF style EOL you need to set cleanUpWindowsEOL to true
    /// </summary>
    /// <param name="text"></param>
    /// <param name="cleanUpCREOL">set it true to get rid of CR</param>
    /// <returns></returns>
    public static IEnumerable<int> Encode(string text, bool cleanUpCREOL = false)
    {
        if (string.IsNullOrEmpty(text))
        {
            yield break;
        }

        if (cleanUpCREOL)
        {
            text = text.Replace("\r", "");
        }

        foreach (var newToken in GetNewTokens(text))
        {
            yield return TokenizerGpt3Settings.Encoder[newToken];
        }
    }

    /// <summary>
    ///     Get token count. This method use LF style EOL, if you use CR LF style EOL you need to set cleanUpWindowsEOL to true
    /// </summary>
    /// <param name="text"></param>
    /// <param name="cleanUpCREOL">set it true to get rid of CR</param>
    /// <returns></returns>
    public static int TokenCount(string text, bool cleanUpCREOL = false)
    {
        if (string.IsNullOrEmpty(text))
        {
            return 0;
        }

        if (cleanUpCREOL)
        {
            text = text.Replace("\r", "");
        }

        return GetNewTokens(text).Count();
    }

    private static IEnumerable<string> GetNewTokens(string text)
    {
        var byteEncoder = BytesToUnicodeCache;
        var matches = EncodingRegex.Matches(text);

        foreach (var match in matches.Cast<Match>())
        {
            var tokenBytes = Encoding.UTF8.GetBytes(match.Value);
            var token = new string(Array.ConvertAll(tokenBytes, x => byteEncoder[x]));
            var newTokens = BytePairEncoding(token).Split(' ');

            foreach (var newToken in newTokens)
            {
                yield return newToken;
            }
        }
    }

Mar 23 '23 10:03 kayhantolga