transformers `PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split

trafficstars

System Info

transformers version: 4.24.0
Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
Python version: 3.10.8
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.13.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

load a PreTrainedTokenizer that contains unique_no_split_tokens, e.g. EleutherAI/gpt-j-6B.

tokenizer = transformers.GPT2Tokenizer.from_pretrained('EleutherAI/gpt-j-6B')

use the tokenizer to split a string that contains a unique_no_split_tokens, e.g. " <|extratoken_1|> ".

print(tokenizer(" <|extratoken_1|> ").input_ids)

Expected behavior

The tokenizer splits the string into 3 tokens (" ", "<|extratoken_1|>" and " "), and gives their ids ([220, 50257, 220]). This is the behavior of PreTrainedTokenizerFast.

But the actual behavior is that the PreTrainedTokenizer only gives the id of "<|extratoken_1|>", i.e. 50257

Jan 14 '23 06:01 Gompyn

This is probably due to the following line, which is still not fixed in the HEAD.

https://github.com/huggingface/transformers/blob/f58248b8240571bbbb0918ddd36cc3fdf061df11/src/transformers/tokenization_utils.py#L532-L537

Jan 14 '23 06:01 Gompyn

This bug strips away \n around my special token, making my model believe that there is no newline in my text.

Jan 14 '23 06:01 Gompyn

@ArthurZucker I can pick up this, Let me know what should be possible fix ?

Jan 22 '23 12:01 raghavanone

There is indeed a discrepancy between the fast and slow version. The problem here is that the tokens are indeed part of the no_split_tokens, but they are not AddedToken. I am not really sure if the fast or slow has the correct behavior 😅

Jan 23 '23 15:01 ArthurZucker

The cleanest way is to have the tokens as AddedTokens because you can handle the rstrip and lstripe arguments

Jan 23 '23 15:01 ArthurZucker

@ArthurZucker I think decode(encode(text)) == text should be true by default, because some use cases (e.g. code generation) require the correct formatting of text. "Automatic formatting" should not be done by default to avoid breaking such use cases. From another point of view, I guess most pre-trained models use a fast tokenizer (as the name fast implies), so these models also expect the behavior of the fast version.

Jan 23 '23 17:01 Gompyn

I think decode(encode(text)) == text should be true by default

This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.

Jan 23 '23 18:01 sgugger

I think decode(encode(text)) == text should be true by default

This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.

I agree that minimal unicode clean up is acceptable (mostly because that does not break my use cases), but whitespace simplification or text lowercasing is not by default enabled, so by default users do get a mostly conservative tokenizer. But to add new tokens, the most simple way (add_tokens('mytoken') with special_tokens=False by default) in a slow tokenizer accidentally (from the view of a user) breaks this conservative behavior, and I think this is unexpected by users.

Jan 24 '23 01:01 Gompyn

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 17 '23 15:02 github-actions[bot]

Is there any progress on this issue? @ArthurZucker

Feb 18 '23 16:02 Gompyn

Not yet! I finally have time so this week should be good!

Feb 20 '23 07:02 ArthurZucker

Is there any progress on this issue?

Apr 10 '23 09:04 Gompyn

Hey, to follow progress is suggest you check #23909, which should try to adresse this.

Jun 01 '23 11:06 ArthurZucker

Quick update, this is gonna take a bit more time as a more in-depth refactoring is needed

Jun 26 '23 04:06 ArthurZucker

PR will be merged this week! 🤗

Aug 16 '23 07:08 ArthurZucker

transformers
transformers copied to clipboard

`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens`

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers transformers copied to clipboard

`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens`

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard