tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Issue with space tokens + BPE tokenizer

Open sdtblck opened this issue 4 years ago • 3 comments
trafficstars

I'm attempting to encode multiple concurrent space tokens as special tokens (to increase compressibility for documents with many spaces, e.g code) - but am getting some issues.

my tokenizer file is here, and below is some code to reproduce the issue:

from tokenizers import Tokenizer
import os


def assert_equal(s):
    x = t.decode(t.encode(s).ids).strip()
    n_spaces_in = s.count(" ")
    n_spaces_out = x.count(" ")
    print("-" * 80)
    print(s)
    print(x)
    print("-" * 80)

    assert n_spaces_in == n_spaces_out, f"{n_spaces_in} -> {n_spaces_out}"


if __name__ == "__main__":

    if not os.path.exists("pile_tokenizer.json"):
        os.system("wget http://eaidata.bmk.sh/data/pile_tokenizer.json")

    t = Tokenizer.from_file("pile_tokenizer.json")

    for i in range(1, 100):
        assert_equal(f"hello{' '*i}world")

specifically, when there's an even number of spaces between two words, the number of spaces in the decoded output is rounded up to an odd number.

The tokenizer was trained like so:

    if special_tokens is None:
        special_tokens=["<|endoftext|>", "<|padding|>", "    ", "  "]
    model = models.BPE()
    tokenizer = Tokenizer(model)

    # Customize pre-tokenization and decoding
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
    tokenizer.decoder = decoders.ByteLevel()
    tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
    tokenizer.normalizer = NFC()

    trainer = trainers.BpeTrainer(vocab_size=vocab_size, special_tokens=special_tokens)

Any help would be much appreciated

sdtblck avatar Oct 21 '21 19:10 sdtblck

Hi @sdtblck ,

Thanks for the reproducible script, very easy to reproduce. Turns out your tokenizer was trained with add_prefix_space which will add a prefix space on words when decoding.

If you modify all add_prefix_space: false within your tokenizer.json everything should work as intended. The add_prefix_space will add the extra space in front of words when decoding.

Does that work ?

Narsil avatar Oct 22 '21 08:10 Narsil

ah ok, I think i misunderstood what the 'add_prefix_space' option does. I had assumed it controlled whether the space between words was at the beginning, or end of the token. But looks like it just controls whether a prefix space is added to the beginning of a document?

sdtblck avatar Oct 24 '21 16:10 sdtblck

Like most options in this library, they exist to emulate previous published work that behave in a certain way and it was needed to behave in exactly the same way here, most of them are overridable like this flag here.

Narsil avatar Oct 25 '21 06:10 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Mar 11 '24 01:03 github-actions[bot]