tiktoken
tiktoken copied to clipboard
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
This is the code I'm trying to run `tokenizer = tiktoken.get_encoding("gpt2")` and this is the error I get: ``` { "name": "ValueError", "message": "Unknown encoding gpt2. Plugins found: ['tiktoken_ext.openai_public']", "stack":...
Hi, I am trying to train Tiktoken on a custom dataset (size 15 GB) with 30k vocab size. It seems it will take a long time to finish. 1 vocab...
There are some inconvenient workarounds for using this software without making an internet connection (which adds **considerable latency on unstable networks**). This use case should see official support. I propose...
What else to do after pip install to use this encoding 
https://www.youtube.com/watch?v=8YnyAjkOap8
This PR realizes the wish expressed in current code to use the faster `Regex`. The text is splitted to pieces, before tokenization, according to regular expression patterns. This PR drops...
As the title, ValueError: not enough values to unpack (expected 2, got 1) when tiktoken.get_encoding("cl100k_base")
The encoding is different for gpt-4o than gpt-4 models. The current implementation matches `"ft:gpt-4": "cl100k_base"` in the `MODEL_PREFIX_TO_ENCODING` dictionary for fined-tuned models based on gpt-4o.
I think it isn't supported yet? Is there anything planned?