tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

Unofficial bindings / ports in other languages

Open hauntsaninja opened this issue 2 years ago • 11 comments
trafficstars

The following projects are not maintained by OpenAI. I cannot vouch that any of them are correct or safe to use. Use at your own risk.

Note that if a tokeniser fails to exactly match tiktoken's behaviour, you may get worse results when sampling from models, with no warning.

Javascript

  • https://github.com/dqbd/tiktoken
  • https://github.com/ceifa/tiktoken-node
  • https://github.com/niieani/gpt-tokenizer
  • The gpt-3-encoder package will work for most GPT-3 models. However, it will often appear to work for Codex or GPT-3.5 while actually being out of distribution, and will not at all work for GPT-4 or embeddings models.

Rust

  • https://github.com/zurawiki/tiktoken-rs

Java

  • https://github.com/eisber/tiktoken
  • https://github.com/knuddelsgmbh/jtokkit

Ruby

  • https://github.com/volition-co/tiktoken

C#

  • https://github.com/dmitry-brazhenko/SharpToken

Go

  • https://github.com/tiktoken-go/tokenizer
  • https://github.com/pkoukk/tiktoken-go

PHP

  • https://github.com/danny50610/bpe-tokeniser

Kotlin

  • https://github.com/aallam/ktoken

Thanks to everyone for building useful things!

I'm happy to link to other projects in this comment.

hauntsaninja avatar Apr 05 '23 22:04 hauntsaninja

👋,

I built a port for go that you can find in the link below

https://github.com/tiktoken-go/tokenizer

bluescreen10 avatar Apr 06 '23 20:04 bluescreen10

I am currently using a another port in Go. https://github.com/pkoukk/tiktoken-go

fang2hou avatar Apr 09 '23 14:04 fang2hou

Hello @hauntsaninja , I was looking at https://github.com/openai/tiktoken/blob/main/src/lib.rs and it appears to be written in Rust. Could this be open sourced into a crate of its own?

rex-remind101 avatar May 13 '23 01:05 rex-remind101

See the FAQ https://github.com/openai/tiktoken/issues/98

hauntsaninja avatar May 13 '23 01:05 hauntsaninja

@hauntsaninja would it be possible to publish the full test suite publicly? That would make it easier to tell whether a given implementation matches (or is close to) the official implementation.

danielcompton avatar May 15 '23 02:05 danielcompton

Here's a pure JavaScript / TypeScript port of tiktoken: https://github.com/niieani/gpt-tokenizer Playground online: https://gpt-tokenizer.dev

niieani avatar Jun 01 '23 02:06 niieani

Here's a pure JavaScript / TypeScript port of tiktoken: https://github.com/niieani/gpt-tokenizer Playground online: https://gpt-tokenizer.dev

Hi,for non-English, such as Chinese token calculations are incorrect image there is openAI Token calculator: image

shylockWu avatar Sep 04 '23 09:09 shylockWu

@shylockWu they're not incorrect. You've set gpt-tokenizer to tokenize using GPT-3.5/GPT-4 encoding, whereas the official openAI token calculator uses the older GPT-3. If you switch the playground to use the older model, you'll get the same result.

niieani avatar Sep 05 '23 02:09 niieani

:wave:

I ported a version of PHP, link here

https://github.com/danny50610/bpe-tokeniser

danny50610 avatar Sep 11 '23 05:09 danny50610

I have built and published a port for Kotlin: https://github.com/aallam/ktoken :)

aallam avatar Oct 11 '23 15:10 aallam

Pure Haskell implementation of tiktoken: https://hackage.haskell.org/package/tiktoken

Gabriella439 avatar Aug 31 '24 20:08 Gabriella439