tiktoken icon indicating copy to clipboard operation
tiktoken copied to clipboard

Tiktoken not published to cargo

Open zurawiki opened this issue 2 years ago • 11 comments
trafficstars

It seems that the tiktoken package is not linkable from Rust using Cargo's default registry.

Are there plans to publish the tiktoken crate? Is it published on another registry?

Thanks for your work on this BPE encoder, I've already found it very useful!


Repro:

In a rust project, run

cargo add tiktoken

Expected behavior:

Cargo should find, download and add tiktoken to the available crates

Actual behavior:

$ cargo add tiktoken
    Updating crates.io index
error: the crate `tiktoken` could not be found in registry index.

zurawiki avatar Jan 24 '23 21:01 zurawiki

In case useful: https://github.com/dust-tt/dust/tree/main/core/src/providers/tiktoken

spolu avatar Feb 02 '23 13:02 spolu

Very cool @spolu! I'd love to package this code as a separate crate for re-use in different rust projects.

zurawiki avatar Feb 02 '23 17:02 zurawiki

For testing this out in other projects, I created and published a rust crate here: https://github.com/zurawiki/tiktoken-rs

Ideally, I hope we can integrate these changes back into the original project, so I'll leave this Issue open until we hear from a maintainer.

zurawiki avatar Feb 02 '23 19:02 zurawiki

Nice!!

spolu avatar Feb 02 '23 19:02 spolu

Thanks, I'm open to this, I just haven't spent the time to figure out Rust packaging yet :-)

I will get around to this at some point, thanks for the link to your repo!

hauntsaninja avatar Feb 02 '23 21:02 hauntsaninja

Can you make also an alternative, pure python version of Tiktoken? For those who cannot compile and run Rust binaries on their system (for various reasons: package managers support, company policy, intranet or local machine security, docking containers limitations, vm restrictions, environment virtualization, lack of Rust support in jupyter notebooks remote hosting, etc).

Emasoft avatar Feb 23 '23 17:02 Emasoft

This is not my area of expertise, but if I have a suggestion -

You can make a cargo workspace, create a tiktoken-lib or a tiktoken-core rust project, and then import it within the current lib.rs. That way it is housed within this repository itself.

https://crates.io/crates/cargo-workspaces is a helper which can allow you to publish individual projects within a workspace. I haven't used it myself though.

DhruvDh avatar Mar 03 '23 17:03 DhruvDh

Can anyone figure out how to replace the python threading with rayon threading? On lines 140-141 of lib.rs there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads.

I am still learning Rust so I am having a hard time with this...

smahm006 avatar Mar 14 '23 00:03 smahm006

Can anyone figure out how to replace the python threading with rayon threading? On lines 140-141 of lib.rs there is a comment where the author mentions he tried threading with rayon but noticed it wasn't much faster than python threads.

I may be mistaken, but see the batch methods here https://github.com/openai/tiktoken/blob/main/tiktoken/core.py

In which case, you would do something like

pub fn encode_batch(&self, texts: Vec<&str>, allowed_special: HashSet<&str>) -> Vec<Vec<usize>> {
        texts
            .into_par_iter()
            .map(|t| self.encode_native(t, &allowed_special).0)
            .collect()
}

and

pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>> {
        texts
            .into_par_iter()
            .map(|t| self.encode_ordinary_native(t))
            .collect()
}

jremb avatar Mar 24 '23 10:03 jremb

Hi, A question, why the mergeable_ranks is downloaded in runtime? why not to have it downloaded in the repo?

def gpt2():
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
        vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
        encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
    )
    return {
        "name": "gpt2",
        "explicit_n_vocab": 50257,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {"<|endoftext|>": 50256},
    }

Isn't this a waste of time in runtime? this should not change, and if it changes, it would no longer be that version totally valid for gtp2 or at least not the one with which the library was tested at the time, maybe have another more current version, tested and the other keep it but deprecated?

Miuler avatar Jul 18 '23 22:07 Miuler

Hi, A question, why the mergeable_ranks is downloaded in runtime? why not to have it downloaded in the repo?

def gpt2():
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
        vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
        encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
    )
    return {
        "name": "gpt2",
        "explicit_n_vocab": 50257,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {"<|endoftext|>": 50256},
    }

Isn't this a waste of time in runtime? this should not change, and if it changes, it would no longer be that version totally valid for gtp2 or at least not the one with which the library was tested at the time, maybe have another more current version, tested and the other keep it but deprecated?

Sorry for the question, I am separating the code to have Rust as a crate, but I was looking at a version of the encoder in rust and when translating I had this doubt.

Miuler avatar Jul 18 '23 22:07 Miuler