transformers icon indicating copy to clipboard operation
transformers copied to clipboard

`PreTrainedTokenizer` (slow) strip tokens that are around `unique_no_split_tokens`

Open Gompyn opened this issue 2 years ago • 11 comments

System Info

  • transformers version: 4.24.0
  • Platform: Linux-5.4.0-135-generic-x86_64-with-glibc2.31
  • Python version: 3.10.8
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 1.13.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

  1. load a PreTrainedTokenizer that contains unique_no_split_tokens, e.g. EleutherAI/gpt-j-6B.
tokenizer = transformers.GPT2Tokenizer.from_pretrained('EleutherAI/gpt-j-6B')
  1. use the tokenizer to split a string that contains a unique_no_split_tokens, e.g. " <|extratoken_1|> ".
print(tokenizer(" <|extratoken_1|> ").input_ids)

Expected behavior

The tokenizer splits the string into 3 tokens (" ", "<|extratoken_1|>" and " "), and gives their ids ([220, 50257, 220]). This is the behavior of PreTrainedTokenizerFast.

But the actual behavior is that the PreTrainedTokenizer only gives the id of "<|extratoken_1|>", i.e. 50257

Gompyn avatar Jan 14 '23 06:01 Gompyn

This is probably due to the following line, which is still not fixed in the HEAD.

https://github.com/huggingface/transformers/blob/f58248b8240571bbbb0918ddd36cc3fdf061df11/src/transformers/tokenization_utils.py#L532-L537

Gompyn avatar Jan 14 '23 06:01 Gompyn

This bug strips away \n around my special token, making my model believe that there is no newline in my text.

Gompyn avatar Jan 14 '23 06:01 Gompyn

@ArthurZucker I can pick up this, Let me know what should be possible fix ?

raghavanone avatar Jan 22 '23 12:01 raghavanone

There is indeed a discrepancy between the fast and slow version. The problem here is that the tokens are indeed part of the no_split_tokens, but they are not AddedToken. I am not really sure if the fast or slow has the correct behavior 😅

ArthurZucker avatar Jan 23 '23 15:01 ArthurZucker

The cleanest way is to have the tokens as AddedTokens because you can handle the rstrip and lstripe arguments

ArthurZucker avatar Jan 23 '23 15:01 ArthurZucker

@ArthurZucker I think decode(encode(text)) == text should be true by default, because some use cases (e.g. code generation) require the correct formatting of text. "Automatic formatting" should not be done by default to avoid breaking such use cases. From another point of view, I guess most pre-trained models use a fast tokenizer (as the name fast implies), so these models also expect the behavior of the fast version.

Gompyn avatar Jan 23 '23 17:01 Gompyn

I think decode(encode(text)) == text should be true by default

This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.

sgugger avatar Jan 23 '23 18:01 sgugger

I think decode(encode(text)) == text should be true by default

This is untrue for pretty much all tokenizers, since tokenization is a destructive operation. At the very least you get back the normalized text (with some minimal unicode clean up) but for some tokenizers like BERT you will have whitespace simplified or text lowercased.

I agree that minimal unicode clean up is acceptable (mostly because that does not break my use cases), but whitespace simplification or text lowercasing is not by default enabled, so by default users do get a mostly conservative tokenizer. But to add new tokens, the most simple way (add_tokens('mytoken') with special_tokens=False by default) in a slow tokenizer accidentally (from the view of a user) breaks this conservative behavior, and I think this is unexpected by users.

Gompyn avatar Jan 24 '23 01:01 Gompyn

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 17 '23 15:02 github-actions[bot]

Is there any progress on this issue? @ArthurZucker

Gompyn avatar Feb 18 '23 16:02 Gompyn

Not yet! I finally have time so this week should be good!

ArthurZucker avatar Feb 20 '23 07:02 ArthurZucker

Is there any progress on this issue?

Gompyn avatar Apr 10 '23 09:04 Gompyn

Hey, to follow progress is suggest you check #23909, which should try to adresse this.

ArthurZucker avatar Jun 01 '23 11:06 ArthurZucker

Quick update, this is gonna take a bit more time as a more in-depth refactoring is needed

ArthurZucker avatar Jun 26 '23 04:06 ArthurZucker

PR will be merged this week! 🤗

ArthurZucker avatar Aug 16 '23 07:08 ArthurZucker