tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

Results 87 tiktoken issues
Sort by recently updated
recently updated
newest added
trafficstars

This is the code I'm trying to run `tokenizer = tiktoken.get_encoding("gpt2")` and this is the error I get: ``` { "name": "ValueError", "message": "Unknown encoding gpt2. Plugins found: ['tiktoken_ext.openai_public']", "stack":...

Hi, I am trying to train Tiktoken on a custom dataset (size 15 GB) with 30k vocab size. It seems it will take a long time to finish. 1 vocab...

There are some inconvenient workarounds for using this software without making an internet connection (which adds **considerable latency on unstable networks**). This use case should see official support. I propose...

What else to do after pip install to use this encoding ![image](https://github.com/user-attachments/assets/58dbec9e-a692-4d61-b8e7-aa6dd07bb3b2)

https://www.youtube.com/watch?v=8YnyAjkOap8

This PR realizes the wish expressed in current code to use the faster `Regex`. The text is splitted to pieces, before tokenization, according to regular expression patterns. This PR drops...

As the title, ValueError: not enough values to unpack (expected 2, got 1) when tiktoken.get_encoding("cl100k_base")

The encoding is different for gpt-4o than gpt-4 models. The current implementation matches `"ft:gpt-4": "cl100k_base"` in the `MODEL_PREFIX_TO_ENCODING` dictionary for fined-tuned models based on gpt-4o.

I think it isn't supported yet? Is there anything planned?