openai icon indicating copy to clipboard operation
openai copied to clipboard

Tokenizer supports multiple encodings, compatible with .Net Standard 2.0

Open Frogley opened this issue 1 year ago • 2 comments

Tokenizer supports multiple encodings: r50k_base, p50k_base, cl100k_base; supports encode and decode method.

  Tokenizer tokenizer = new Tokenizer("cl100k_base");
  Tokenizer tokenizer = new Tokenizer().FromModelName("gpt-3.5-turbo-0301");
  Tokenizer tokenizer = new Tokenizer().FromModel(Models.Model.TextDavinciV3);

  string str = @"床前明月光,疑是地上霜,举头望明月,低头思故乡。";
  int[] res = tokenizer.Encode(str);
  // res =[ 11795 232 25580 31958 9953 6708 231 3922 163 244 239 21043 30590 17905 52597 250 3922 3574 122 65455 4916 249 31958 9953 3922 8687 236 65455 91763 8067 227 18259 94 1811]
  string str2 = tokenizer.Decode(res);
  // str2 = "床前明月光,疑是地上霜,举头望明月,低头思故乡。"

Frogley avatar Apr 07 '23 02:04 Frogley

Hey @Frogley, I haven't forgotten about your PR. I am just trying to understand how the tokenizer works and comparing it against your PR, which is taking up a lot of time. I apologize for the delay. :/

kayhantolga avatar May 18 '23 15:05 kayhantolga

Great. To be honest, my understanding of the core algorithm for the tokenizer is somewhat vague, I didn't fully grasp it. Basically, my PR is a translation of tiktoken/lib.rs from Rust into C#, with some simplifications. After the translation was complete, I did a few case tests and they were consistent. But I didn't do any extensive testing and comparison. Hope my work can be of help to you.

Frogley avatar May 18 '23 16:05 Frogley