tiktoken
tiktoken copied to clipboard
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
This PR builds on top of #40, which introduces both Java bindings and a split Rust core.
**Issue** I ran `encoding = tiktoken.get_encoding("cl100k_base")` and encountered the following error: `SSLError: HTTPSConnectionPool(host='openaipublic.blob.core.windows.net', port=443): Max retries exceeded with url: /encodings/cl100k_base.tiktoken (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))`...
Unknown encoding cl100k_base. Plugins found: ['tiktoken_ext.openai_public']
This code here: https://github.com/openai/tiktoken/blob/39f29cecdb6fc38d9a3434e5dd15e4de58cf3c80/tiktoken/core.py#L375-L383 I wanted to do something special on this exception in my own code, so I had to write this: ```python try: tokens = encoding.encode(text, **kwargs) except...
Hello, I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that...
I'm working on a PR and would like to understand the reason for the behaviour of this` _encode_bytes` function when it hits an invalid UTF-8 sequence, to ensure I don't...
We have implemented a lot of logic around token counting for ChatCompletion requests, and it feels like the logic should go in a separate package. I'm wondering if tiktoken would...
https://github.com/openai/tiktoken/blob/095924e02c85617df6889698d94515f91666c7ea/src/lib.rs#L524 Hello, I'm reading the lib.rs code and found the `encode_with_unstable` api, tt donesn't seem to be used in the documentation? But it occupied so much in the lib.rs, and...
**Description:** The current implementation of the _byte_pair_merge function in the BPE code could benefit from optimization to improve performance. By applying certain optimizations, such as using inclusive range slicing and...