nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Is tiktoken compatiable with huggingface GPT2Tokenizer

Open zyzhang1130 opened this issue 1 year ago • 1 comments

For some reason, I need to directly use the output token ids on hugging face's GPT2. I would like to know is the embedding generated from tiktoken the same as that from GPT2Tokenizer.from_pretrained("gpt2")? I have tested so far there seems to be no error but I have to make sure about it. Thanks.

zyzhang1130 avatar Apr 01 '23 15:04 zyzhang1130

For my info, I asked GPT because I don't understand people ask these kind of question not first to GPT. Is there any reason this answer does not satisfy you. I don't mean to be rude, I really wonder. Also why gpt is not answering these question, analog to co pilot in the IDE.

I'm not aware of any package named "tiktoken" in the Python ecosystem. However, if you meant the popular Python package for tokenization of text, "tokenizers", then yes, it is compatible with Hugging Face's GPT2Tokenizer.

Tokenizers provides a fast and efficient implementation of various tokenization algorithms, including byte-pair encoding (BPE) used by GPT-2 and GPT-3 models. Hugging Face's GPT2Tokenizer is built on top of Tokenizers, and provides a user-friendly API for tokenizing text for GPT-2 and other transformer-based models.

You can use Tokenizers to preprocess your data and then use GPT2Tokenizer to tokenize it for GPT-2. Here is an example:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
from transformers import GPT2Tokenizer

# Train a Tokenizers BPE model on your corpus
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=10000, special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])
tokenizer.train(["your_text_corpus.txt"], trainer=trainer)

# Save the Tokenizers model
tokenizer.save("your_tokenizer")

# Load the Tokenizers model and use it to tokenize text
tokenizer = Tokenizer.from_file("your_tokenizer")
text = "your text to tokenize"
tokenized_text = tokenizer.encode(text)

# Use GPT2Tokenizer to convert Tokenizers output to input for GPT-2
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
input_ids = gpt2_tokenizer.convert_tokens_to_ids(tokenized_text.tokens)

Note that the number of tokens produced by Tokenizers may be different from the number of tokens expected by GPT-2, so you may need to adjust the input accordingly.

dikkietrom avatar May 04 '23 08:05 dikkietrom

@zyzhang1130 Yes, they should be compatible:

import tiktoken
from transformers import GPT2Tokenizer

encoding = tiktoken.encoding_for_model("gpt2")
tiktoken_vec = encoding.encode("Picasso")

tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
huggingface_vec = tokenizer.encode('Picasso')

Both tiktoken_vec and huggingface_vec will be the same in above code.

christian-vorhemus avatar Jun 11 '23 19:06 christian-vorhemus

thanks Christian for shedding some light instead of only giving a thumb down on my answer, like someone felt the need to and maybe also had the wild idea that such helps out anyone in anyway.

dikkietrom avatar Jun 18 '23 00:06 dikkietrom