tokenizers.models.BPE loses whitespace with GPT-2 pretrained vocab & merges
It's not clear how (or if) tokenizers.models.BPE is meant to be used with GPT-2 tokenization. We failed to find an answer in the API documentation, so we developed an ugly hack instead. Switching from GPT2Tokenizer to BPE was necessary in order to use the BPE dropout feature, so we would like to know if there is a recommended way to do this.
import tokenizers
bpe_vocab, bpe_merges = tokenizers.models.BPE.read_file("./data/gpt2-vocab.json", "./data/gpt2-merges.txt")
bpe_tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE(bpe_vocab, bpe_merges))
print(bpe_tokenizer.encode("abc abc").ids)
Actual Result: [39305, 39305] => 'abcabc'
Expected Result: [39305, 450, 66] => 'abc abc'
Workaround: Pre-processing strings with GPT2Parser's regex (https://github.com/huggingface/transformers/blob/52d2e6f6e904ef9b75c78716ce77b98196ed837a/src/transformers/models/gpt2/tokenization_gpt2.py#L194) and bytes_to_unicode (https://github.com/huggingface/transformers/blob/52d2e6f6e904ef9b75c78716ce77b98196ed837a/src/transformers/models/gpt2/tokenization_gpt2.py#L66) lookup table.
Hi @umbra-scientia .
tl;dr you can probably just set the value in tokenizer.json file from gpt2: https://huggingface.co/gpt2/raw/main/tokenizer.json
And just load it from file.
The long answer:
tokenizers isn't really modeled after gpt2 or any other model, and is meant to be used very differently. You should be able to easily express the same as gpt2 though.
In tokenizers, you will assemble components to get your tokenizer, and it happens we have the correct components to recreate almost all tokenizers used in transformers.
If you simply want to run gpt2 I recommend using transformers and doing
tokenizer = AutoTokenizer.from_pretrained("gpt2")
If you want to train gpt2-like tokenizer with this library then you can assemble it this way:
from tokenizers import (
Tokenizer,
AddedToken,
pre_tokenizers,
models,
decoders,
trainers,
processors,
)
tokenizer = Tokenizer(
models.BPE(
vocab,
merges,
dropout=dropout,
continuing_subword_prefix=continuing_subword_prefix or "",
end_of_word_suffix=end_of_word_suffix or "",
)
)
tokenizer.normalizer = Sequence([normalizers.NFKD()])
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=add_prefix_space)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=trim_offsets)
ByteLevel is the gpt2 trick to get printable characters for every byte.
Every option should be also readable from the JSON file here: https://huggingface.co/gpt2/raw/main/tokenizer.json
Each component should be detailed here: https://huggingface.co/docs/tokenizers/python/latest/components.html
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.