`return_overflowing_tokens` has different behavior between slow tokenizer and fast tokenizer
System Info
transformersversion: 4.28.1- Platform: Linux-5.10.147+-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.0+cu118 (False)
- Tensorflow version (GPU?): 2.12.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.6.8 (cpu)
- Jax version: 0.4.8
- JaxLib version: 0.4.7
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
@Arthur
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I'm studying the nlp course chapter 6, and find return_overflowing_tokens has different behavior between slow tokenizer and fast tokenizer, is it a feature or a bug?
from transformers import DistilBertTokenizer, DistilBertTokenizerFast
model_checkpoint = "distilbert-base-cased-distilled-squad"
slow_tokenizer = DistilBertTokenizer.from_pretrained(model_checkpoint)
fast_tokenizer = DistilBertTokenizerFast.from_pretrained(model_checkpoint)
sentence = "This sentence is not too long but we are going to split it anyway."
inputs = fast_tokenizer(
sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)
print(inputs["input_ids"])
Then I got the output
[[101, 1188, 5650, 1110, 1136, 1315, 1263, 102], [101, 1315, 1263, 1133, 1195, 1132, 1280, 102], [101, 1132, 1280, 1106, 3325, 1122, 4050, 102], [101, 1122, 4050, 119, 102]]
but when I replace fast_tokenizer with slow_tokenizer, I got
[101, 1188, 5650, 1110, 1136, 1315, 1263, 102]
Expected behavior
The slow tokenizer should behave same as fast tokenizer.
cc @ArthurZucker but I think the overflowing tokens is specifically a feature of our fast tokenizers, so it's completely normal that you don't ahve it in the slow ones.
Hey! Thanks for reporting this. No it seems that the return_overflowing_tokens logic is implemented in the base class, so might be interesting to look at this. I'll have a look when I can, in the mean time labelling as a tokenizers bug
Okay, it seems that there is a difference in design, tokenizers library returns a batch of overflowing tokens, which takes into account the max length and stride. So it creates a batch from a non batched sentence, which could (?) be what was originally intended. However, this will fail if return_tensors=True with an error.
On the other hand, transformers just cuts the input sentence and returns everything that was truncated, without creating this strange behaviour.
I am not really sure what is best honestly, cc @Narsil I think it's fine to just leave it as is? ( I can edit the doc to make sure that the format in slow is different from fast ?)
Yes I'm not sure we should do something about it.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
The problem still exists in the latest version
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.