`align_to_words=True` in `QuestionAnsweringPipeline` can lead to duplicate answers
System Info
transformersversion: 4.31.0- Platform: macOS-13.4.1-arm64-arm-64bit
- Python version: 3.11.4
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
@Nars
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
from transformers import pipeline
answers = pipeline("question-answering", model="deepset/tinyroberta-squad2")(
question="Who is the chancellor of Germany?",
context="Angela Merkel was the chancellor of Germany.",
top_k=10
)
print(answers[0]) # Returns {'score': 0.9961308836936951, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}
print(answers[5]) # Returns {'score': 7.520078361267224e-05, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}
If align_to_words is set to True (which is the default), all start or end tokens that are contained in the same word are mapped to the same start and end character index (see here). This is expected when using align_to_words. However, the top_k filtering happens before this step so duplicate answers can remain.
Expected behavior
Ideally, the mapping from token to word should happen at around this point. You would have a start and end probability for each word. If there are multiple tokens in a word, their probabilities should be summed. This would make the probabilities more correct because every token in the word would affect the probability of selecting the word.
If this is too slow, there should at least be a check for duplicates somewhere here. This would mean that you are not guaranteed to get k answers when setting top_k, but only that you get at most k answers. A way to mitigate that somewhat (but not perfectly), would be to use a higher value than top_k when calling select_starts_ends here.
cc @Rocketknight1 if you can have a look! 🤗
@michelbartels I managed to reproduce the bug, and I think your diagnosis of it is completely correct. Would you be interested in filing a PR with your solution? If you don't have time, that's fine! Just let us know and we'll put it on the list to get fixed internally.
@Rocketknight1 Thanks for looking into this, I am afraid I currently don't have time to contribute a fix.
@MichelBartels No problem! Thanks for the clean bug report anyway - I'll let you know when we have a fix.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@BCreativeS , the issue still needs to be addressed.
This issue still needs to be addressed.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue still needs to be addressed.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue still needs to be addressed.
cc @Rocketknight1 if you can have a look! Would be nice 🤗
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This issue still needs to be addressed.
@Rocketknight1 probably didn't have time to have a look, feel free to have a go if anyone feels like it!
Yeah, it's on my list, but it's fairly low priority and the list is long, I'm sorry!
Heys @ArthurZucker and @MichelBartels , I made a PR for the issue: https://github.com/huggingface/transformers/pull/38799
I implemented a subword probability summing logic for calculating start and end probabilities. Code functions well, however it's long and (sort of) ugly. Please make the review, if you like the changes, I'll go benchmark the performance.