tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Set `add_prefix_space = False` for existing pre-trained tokenizers

Open cyk1337 opened this issue 1 year ago • 7 comments

I would like to add special tokens into an existing (pre-trained) tokenizer, in which the added tokens are not whitespace-separated between tokens. Therefore, the decoded string contains additional whitespace ahead of the word start position, due to the configuration of add_prefix_space = True I guess. How to disable the feature (add_prefix_space) to remove the prefix space between words?

For instance, after adding a special token <special:0> into the spm (Unigram LM) tokenizer, it can split the text "<special:0>word1 word2" into [<special:0>, "▁word1", "▁word2"]. After decoding, the resultant sequence would be "<special:0> word1 word2" with an additional space between "<special:0>" and "word1". Any solutions to handle it except post-processing? @Narsil

cyk1337 avatar Aug 22 '22 04:08 cyk1337

Hi @cyk1337 ,

The extra space is not added by the added token, but by the "_word1". You can try playing with add_prefix_space (If unsure how to do it, editing directly the saved tokenizer JSON file might help ensuring your modifications are actually taken into account).

But if you do that, then 'word1" might be a different token than previously and the model might have trouble with this new tokenized string (Depends if you're retraining/finetuning, but you should be careful about this token change).

If you're purely looking at a decoding problem that occurs within an original string, I recommend you use offsets instead. offsets are used to know where originally your token was generated in the original string, so you will be able to detect that the extra space was added and not put it back.

decode in general should be kept for purely generated tokens as much as possible, your example seems to suggest this is not the case.

Narsil avatar Aug 22 '22 08:08 Narsil

Hi @Narsil , thank you for the prompt reply. Yeah the extra space is induced by the "▁" in "▁word1". Thank you for providing the solution for decoding purpose with return_offsets_mapping;)

Indeed, the added tokens are for retraining/finetuning. After setting the add_prefix_space=False for pre_tokenizer, it seems worked. By saying the "token change", do you mean that after canceling add_prefix_space, the original "▁word1" becomes "word1", which may cause the resultant tokens to be (for example) ["word", "1"] instead of ["▁w", "ord", "1"]?

Additional question:

  • What is the difference between add_prefix_space in pre_tokenizer and decoder in the JSON config file?

cyk1337 avatar Aug 22 '22 09:08 cyk1337

By saying the "token change", do you mean that after canceling add_prefix_space, the original "▁word1" becomes "word1", which may cause the resultant tokens to be (for example) ["word", "1"] instead of ["▁w", "ord", "1"]?

Exactly. So if you're training /retraining you will force your model to learn about this, but it does imply a shift in model weights. Not sure how relevant my comment is as I haven't encountered this issue myself, but I would be vigilant about "bad" results as they might stem from this.

What is the difference between add_prefix_space in pre_tokenizer and decoder in the JSON config file?

In pre_tokenizer it's intended as a compatibility with the original SPM implementation which does it all the time, basically the start of a string has to be understood as a new "word", so SPM adds an extra space in front. That what this flag replicates here.

In the decoding I am not sure where you are seeing it. But it's probably used as a heuristic to produce something as a string that will attempt to look like something you want.

Narsil avatar Aug 22 '22 09:08 Narsil

Exactly. So if you're training /retraining you will force your model to learn about this, but it does imply a shift in model weights.

Yes, simply changing the tokenizing way can change the inputs of models, possibly causing unexpected results. From my understanding, the add_prefix_space is for tokenizing the first word at the sentence beginning (w/o whitespace on the left) as if it exists inside a sentence (w/ a whitespace). So would it be a possible solution by adding a whitespace ahead of the sentence to be tokenized to keep the input tokens unchanged after canceling add_prefix_space?

Taking the string "Huggingface tokenizer library." for example, the difference would be as follows:

x="Huggingface tokenizer library."
# add_prefix_space=True
-> ['▁Hu', 'gging', 'face', '▁', 'token', 'izer', '▁library', '.', '</s>']
# add_prefix_space=False
-> ['Hu', 'gging', 'face', '▁', 'token', 'izer', '▁library', '.', '</s>']

x=" "+x
# add_prefix_space=False
-> ['▁Hu', 'gging', 'face', '▁', 'token', 'izer', '▁library', '.', '</s>']

Not sure how relevant my comment is as I haven't encountered this issue myself, but I would be vigilant about "bad" results as they might stem from this.

The actual need is transfer learning while preserving the model ability on the original domain, where both domains have different vocabularies. Another possible solution I think would be keeping everything unchanged (add_prefix_space=True) for the original domain, and setting add_prefix_space=False for the transferred target domain.

In the decoding I am not sure where you are seeing it. But it's probably used as a heuristic to produce something as a string that will attempt to look like something you want.

The decoder setting is exported by the T5 tokenizer (tokenizers.json). Seems no effect on my problem after changing it. Not sure how it works.

 "decoder": {
    "type": "Metaspace",
    "replacement": "▁",
    "add_prefix_space": true
  }

cyk1337 avatar Aug 22 '22 09:08 cyk1337

So would it be a possible solution by adding a whitespace ahead of the sentence to be tokenized to keep the input tokens unchanged after canceling add_prefix_space? That would work, but this would have to be dealt by you. Again if you can get away with offets you don't need to make any modifications.

The trick is , special_tokens are treated AHEAD of the rest of the string, so "some<A>thing" actually gets interpreted as "[some", "thing"] before it even reaches the pre_tokenizer so you need some trickier mecanism if you want to deal with extra spaces yourself. You can always go the custom pre_tokenizer route, but again, I think offsets really solve your issue.

Not sure how it works.

I don't remember from memory either I think it's really to know wheter to add extra space during decode or not.

Narsil avatar Aug 22 '22 10:08 Narsil

The trick is , special_tokens are treated AHEAD of the rest of the string, so "something" actually gets interpreted as "[some", "thing"] before it even reaches the pre_tokenizer so you need some trickier mechanism if you want to deal with extra spaces yourself. You can always go the custom pre_tokenizer route, but again, I think offsets really solve your issue.

Any guide to use offsets to solve this problem?

cyk1337 avatar Aug 22 '22 11:08 cyk1337

Here: https://huggingface.co/docs/tokenizers/main/en/quicktour#using-the-tokenizer

Narsil avatar Aug 22 '22 12:08 Narsil