Support for gpt-turbo-3.5 cl100k_base encoding
Does the package support the cl100k_base encoding with is used in chat-gpt
hello world Encoded: [31373,995]
https://github.com/openai/tiktoken/blob/main/tests/test_simple_public.py
This matches the gpt2 encoding schema. SO it is probably not the same and we would need the updated vocab.bpe
and encoder map to support the new version this is probably also why the encoding length is off. I would say its probably fine to use for estimation but I would not rely on it for the more complicated models until We can find and implement the new version that is used for embeddings.
https://news.ycombinator.com/item?id=34008839
some more info on tokenizers used by openai
TODO extract for 100k data from tokenizer, compile the rust lib to webasemably or reimplement in c++
I am going to put this on hold because it's a good enough approximation for front-end user input validation if we add a 5-10% buffer but it would be nice to have a js implementation of the new version. If anyone wants to help with that I would appreciate it. Even just building the python version and dumping the data so some json files ...
Thanks
7013e4097f15ef400554af6ea04248da561c3c59
I have added the data from
https://community.openai.com/t/how-do-you-make-a-bpe-file-for-tokenizer/94752/13
https://github.com/blinkdata/c-tokenizer
We still need to process this into js and add a new Encoder.js
Hi any update on this please?
The repository linked jas a ts script for loading the vocab files
It's worth stealing some of the implementation as the js project looks relatively clean. just the tooling is a bit bloated
the encoding seems to be here
https://github.com/dqbd/tiktoken/blob/072dd12962cabeca67c5088e3d8a8d006af19482/scripts/ranks.ts#L5