promptable icon indicating copy to clipboard operation
promptable copied to clipboard

Wrong tokenizer used for OpenAI embeddings

Open darknoon opened this issue 2 years ago • 1 comments

I was looking through the OpenAI code and noticed that the wrong tokenizer is used for newer models like text-embedding-ada-002 that use cl100k, implemented by tiktoken.

There is a list of encodings here for their public models.

I'm currently looking at making a wasm build of tiktoken, though I think a pure js approach would also work fine.

darknoon avatar Feb 18 '23 05:02 darknoon

This might work -> https://www.npmjs.com/package/@dqbd/tiktoken @darknoon

Let me know

cfortuner avatar Feb 20 '23 13:02 cfortuner