SharpToken
SharpToken copied to clipboard
add support for utf8 input and output
add support for utf8 input and output Proposed API is
public List<int> EncodeFromUtf8(ReadOnlySpan<byte> lineToEncode, ISet<ReadOnlySpan<byte>> allowedSpecial = null, ISet<ReadOnlySpan<byte>> disallowedSpecial = null);
public byte[] DecodeToUtf8(IEnumerable<int> inputTokensToDecode);
Hi!
Thanks for reaching out. Why do you need this?
We are currently implementing a process wherein certain tokens are substituted with alternative tokens in the OpenAI request, and subsequently restored in the response. This method has been adopted as a strategy to minimize the total number of tokens utilized.
To facilitate this process, we are using the SharpToken library. However, we have encountered an issue related to encoding, arising due to the fact that the OpenAI API accepts and returns data in the UTF-8 format, whereas our replacements are causing discrepancies when mapped onto C# UTF-16 strings.
As a temporary solution, we have been extracting the BytePairEncodingCore from GptEncoding using reflection, and invoking the DecodeNative function on it. This has been providing the expected results.