SharpToken icon indicating copy to clipboard operation
SharpToken copied to clipboard

add support for utf8 input and output

Open abdulkareemnalband opened this issue 1 year ago • 2 comments
trafficstars

add support for utf8 input and output Proposed API is

public List<int> EncodeFromUtf8(ReadOnlySpan<byte> lineToEncode, ISet<ReadOnlySpan<byte>> allowedSpecial = null, ISet<ReadOnlySpan<byte>> disallowedSpecial = null);
public byte[] DecodeToUtf8(IEnumerable<int> inputTokensToDecode);

abdulkareemnalband avatar Sep 13 '24 13:09 abdulkareemnalband

Hi!

Thanks for reaching out. Why do you need this?

dmitry-brazhenko avatar Sep 14 '24 19:09 dmitry-brazhenko

We are currently implementing a process wherein certain tokens are substituted with alternative tokens in the OpenAI request, and subsequently restored in the response. This method has been adopted as a strategy to minimize the total number of tokens utilized.

To facilitate this process, we are using the SharpToken library. However, we have encountered an issue related to encoding, arising due to the fact that the OpenAI API accepts and returns data in the UTF-8 format, whereas our replacements are causing discrepancies when mapped onto C# UTF-16 strings.

As a temporary solution, we have been extracting the BytePairEncodingCore from GptEncoding using reflection, and invoking the DecodeNative function on it. This has been providing the expected results.

abdulkareemnalband avatar Sep 16 '24 04:09 abdulkareemnalband