transformers
transformers copied to clipboard
`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens`
System Info
-
transformers
version: 4.24.0 - Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
- Python version: 3.10.8
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
@ArthurZucker
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Steps to reproduce the behavior:
- load a
PreTrainedTokenizer
that containsunique_no_split_tokens
, e.g.EleutherAI/gpt-j-6B
.
tokenizer = transformers.GPT2Tokenizer.from_pretrained('EleutherAI/gpt-j-6B')
- use the tokenizer to split a string that contains a
unique_no_split_tokens
, e.g." <|extratoken_1|> "
.
print(tokenizer(" <|extratoken_1|> ").input_ids)
Expected behavior
The tokenizer splits the string into 3 tokens (" "
, "<|extratoken_1|>"
and " "
), and gives their ids ([220, 50257, 220]
). This is the behavior of PreTrainedTokenizerFast
.
But the actual behavior is that the PreTrainedTokenizer
only gives the id of "<|extratoken_1|>"
, i.e. 50257
This is probably due to the following line, which is still not fixed in the HEAD.
https://github.com/huggingface/transformers/blob/f58248b8240571bbbb0918ddd36cc3fdf061df11/src/transformers/tokenization_utils.py#L532-L537
This bug strips away \n
around my special token, making my model believe that there is no newline in my text.
@ArthurZucker I can pick up this, Let me know what should be possible fix ?
There is indeed a discrepancy between the fast
and slow
version.
The problem here is that the tokens are indeed part of the no_split_tokens
, but they are not AddedToken
.
I am not really sure if the fast
or slow
has the correct behavior 😅
The cleanest way is to have the tokens as AddedTokens
because you can handle the rstrip
and lstripe
arguments
@ArthurZucker I think decode(encode(text)) == text
should be true by default, because some use cases (e.g. code generation) require the correct formatting of text. "Automatic formatting" should not be done by default to avoid breaking such use cases.
From another point of view, I guess most pre-trained models use a fast tokenizer (as the name fast
implies), so these models also expect the behavior of the fast
version.
I think decode(encode(text)) == text should be true by default
This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.
I think decode(encode(text)) == text should be true by default
This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.
I agree that minimal unicode clean up is acceptable (mostly because that does not break my use cases), but whitespace simplification or text lowercasing is not by default enabled, so by default users do get a mostly conservative tokenizer.
But to add new tokens, the most simple way (add_tokens('mytoken')
with special_tokens=False
by default) in a slow tokenizer accidentally (from the view of a user) breaks this conservative behavior, and I think this is unexpected by users.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Is there any progress on this issue? @ArthurZucker
Not yet! I finally have time so this week should be good!
Is there any progress on this issue?
Hey, to follow progress is suggest you check #23909, which should try to adresse this.
Quick update, this is gonna take a bit more time as a more in-depth refactoring is needed
PR will be merged this week! 🤗