tiktoken
tiktoken copied to clipboard
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
Currently the BPE_FILE is hardcoded to the the path https://openaipublic.blob.core.windows.net/.... This host path is presents challenge when running in container in AWS VPC environment. It will be great if we...
This is a housekeeping code change suggestion. This project is released under the MIT license as per the `LICENSE` file's contents, however the current metadata notation makes handling that information...
This allow the source to be build on Debian Bookworm using the packages provided from Debian.
Should be a matter of putting `debug = true` in `Cargo.toml`. before: ``` num_threads: 1, num_bytes: 100005605 tiktoken 6603873.890819601 bytes / s huggingface 1668452.742767104 bytes / s webtext encode tiktoken...
## Usage help Check out this awesome tokeniser app https://tiktokenizer.vercel.app/ built by [Diagram](https://diagram.com/)! Check out the [OpenAI cookbook](https://github.com/openai/openai-cookbook)! In particular, the following are great examples of using `tiktoken`: - [How...
Builds on #50 to add ruby bindings. Mostly leaving it here for awareness. It'd be great to merge in some of these refactors and/or publish the rust library so folks...
Minor patch to enable ppc64le wheels. This is no change content-wise :-)
Code example: ```py3 enc = tiktoken.get_encoding("cl100k_base") enc.decode([100256]) ``` Trace: ```py3 thread '' panicked at 'no entry found for key', src[/lib.rs:210:37](https://file+.vscode-resource.vscode-cdn.net/lib.rs:210:37) --------------------------------------------------------------------------- PanicException Traceback (most recent call last) [/var/folders/m9/s4s3bdq96pn3dk13fbgpw6rm0000gn/T/ipykernel_9548/1299473396.py](https://file+.vscode-resource.vscode-cdn.net/var/folders/m9/s4s3bdq96pn3dk13fbgpw6rm0000gn/T/ipykernel_9548/1299473396.py) in 1...
Hey, considering its superiority over SPE tokenizers would you provide some sample/example code to train a tiktoken tokenizer from scratch on a custom dataset also like training BPE/SPE does it...