transformers icon indicating copy to clipboard operation
transformers copied to clipboard

`return_overflowing_tokens` has different behavior between slow tokenizer and fast tokenizer

Open BuxianChen opened this issue 2 years ago • 1 comments

System Info

  • transformers version: 4.28.1
  • Platform: Linux-5.10.147+-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • Huggingface_hub version: 0.14.1
  • Safetensors version: not installed
  • PyTorch version (GPU?): 2.0.0+cu118 (False)
  • Tensorflow version (GPU?): 2.12.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.6.8 (cpu)
  • Jax version: 0.4.8
  • JaxLib version: 0.4.7
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@Arthur

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

I'm studying the nlp course chapter 6, and find return_overflowing_tokens has different behavior between slow tokenizer and fast tokenizer, is it a feature or a bug?

from transformers import DistilBertTokenizer, DistilBertTokenizerFast

model_checkpoint = "distilbert-base-cased-distilled-squad"
slow_tokenizer = DistilBertTokenizer.from_pretrained(model_checkpoint)
fast_tokenizer = DistilBertTokenizerFast.from_pretrained(model_checkpoint)
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = fast_tokenizer(
    sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)
print(inputs["input_ids"])

Then I got the output

[[101, 1188, 5650, 1110, 1136, 1315, 1263, 102], [101, 1315, 1263, 1133, 1195, 1132, 1280, 102], [101, 1132, 1280, 1106, 3325, 1122, 4050, 102], [101, 1122, 4050, 119, 102]]

but when I replace fast_tokenizer with slow_tokenizer, I got

[101, 1188, 5650, 1110, 1136, 1315, 1263, 102]

Expected behavior

The slow tokenizer should behave same as fast tokenizer.

BuxianChen avatar Apr 26 '23 07:04 BuxianChen

cc @ArthurZucker but I think the overflowing tokens is specifically a feature of our fast tokenizers, so it's completely normal that you don't ahve it in the slow ones.

sgugger avatar Apr 26 '23 12:04 sgugger

Hey! Thanks for reporting this. No it seems that the return_overflowing_tokens logic is implemented in the base class, so might be interesting to look at this. I'll have a look when I can, in the mean time labelling as a tokenizers bug

ArthurZucker avatar May 26 '23 16:05 ArthurZucker

Okay, it seems that there is a difference in design, tokenizers library returns a batch of overflowing tokens, which takes into account the max length and stride. So it creates a batch from a non batched sentence, which could (?) be what was originally intended. However, this will fail if return_tensors=True with an error. On the other hand, transformers just cuts the input sentence and returns everything that was truncated, without creating this strange behaviour. I am not really sure what is best honestly, cc @Narsil I think it's fine to just leave it as is? ( I can edit the doc to make sure that the format in slow is different from fast ?)

ArthurZucker avatar Jun 22 '23 11:06 ArthurZucker

Yes I'm not sure we should do something about it.

Narsil avatar Jun 22 '23 12:06 Narsil

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 16 '23 15:07 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

The problem still exists in the latest version

0xDing avatar Jul 21 '23 14:07 0xDing

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 14 '23 15:08 github-actions[bot]