tiktoken issues

Unknown encoding gpt2

This is the code I'm trying to run `tokenizer = tiktoken.get_encoding("gpt2")` and this is the error I get: ``` { "name": "ValueError", "message": "Unknown encoding gpt2. Plugins found: ['tiktoken_ext.openai_public']", "stack":...

aryagxr

Tiktoken educational BPE trainer takes long time to train with vocab size 30k

2

Hi, I am trying to train Tiktoken on a custom dataset (size 15 GB) with 30k vocab size. It seems it will take a long time to finish. 1 vocab...

sagorbrur

[FR] Add `--offline`

3

There are some inconvenient workarounds for using this software without making an internet connection (which adds **considerable latency on unstable networks**). This use case should see official support. I propose...

NightMachinery

chatgpt-4o-latest is not yet added

jvlinsta

Facing erros in importing the o200k_base

What else to do after pip install to use this encoding ![image](https://github.com/user-attachments/assets/58dbec9e-a692-4d61-b8e7-aa6dd07bb3b2)

JaynouOliver

https://www.youtube.com/watch?v=8YnyAjkOap8

Anand-her

Uses Regex instead of fancy-regex - 6x speedup

3

This PR realizes the wish expressed in current code to use the faster `Regex`. The text is splitted to pieces, before tokenization, according to regular expression patterns. This PR drops...

Majdoddin