tokenizers
tokenizers copied to clipboard
Issue with space tokens + BPE tokenizer
I'm attempting to encode multiple concurrent space tokens as special tokens (to increase compressibility for documents with many spaces, e.g code) - but am getting some issues.
my tokenizer file is here, and below is some code to reproduce the issue:
from tokenizers import Tokenizer
import os
def assert_equal(s):
x = t.decode(t.encode(s).ids).strip()
n_spaces_in = s.count(" ")
n_spaces_out = x.count(" ")
print("-" * 80)
print(s)
print(x)
print("-" * 80)
assert n_spaces_in == n_spaces_out, f"{n_spaces_in} -> {n_spaces_out}"
if __name__ == "__main__":
if not os.path.exists("pile_tokenizer.json"):
os.system("wget http://eaidata.bmk.sh/data/pile_tokenizer.json")
t = Tokenizer.from_file("pile_tokenizer.json")
for i in range(1, 100):
assert_equal(f"hello{' '*i}world")
specifically, when there's an even number of spaces between two words, the number of spaces in the decoded output is rounded up to an odd number.
The tokenizer was trained like so:
if special_tokens is None:
special_tokens=["<|endoftext|>", "<|padding|>", " ", " "]
model = models.BPE()
tokenizer = Tokenizer(model)
# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
tokenizer.normalizer = NFC()
trainer = trainers.BpeTrainer(vocab_size=vocab_size, special_tokens=special_tokens)
Any help would be much appreciated
Hi @sdtblck ,
Thanks for the reproducible script, very easy to reproduce.
Turns out your tokenizer was trained with add_prefix_space which will add a prefix space on words when decoding.
If you modify all add_prefix_space: false within your tokenizer.json everything should work as intended.
The add_prefix_space will add the extra space in front of words when decoding.
Does that work ?
ah ok, I think i misunderstood what the 'add_prefix_space' option does. I had assumed it controlled whether the space between words was at the beginning, or end of the token. But looks like it just controls whether a prefix space is added to the beginning of a document?
Like most options in this library, they exist to emulate previous published work that behave in a certain way and it was needed to behave in exactly the same way here, most of them are overridable like this flag here.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.