transformers `align_to_words=True` in `QuestionAnsweringPipeline` can lead to duplicate answers

System Info

transformers version: 4.31.0
Platform: macOS-13.4.1-arm64-arm-64bit
Python version: 3.11.4
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@Nars

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import pipeline

answers = pipeline("question-answering", model="deepset/tinyroberta-squad2")(
    question="Who is the chancellor of Germany?",
    context="Angela Merkel was the chancellor of Germany.",
    top_k=10
)
print(answers[0]) # Returns {'score': 0.9961308836936951, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}
print(answers[5]) # Returns {'score': 7.520078361267224e-05, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}

If align_to_words is set to True (which is the default), all start or end tokens that are contained in the same word are mapped to the same start and end character index (see here). This is expected when using align_to_words. However, the top_k filtering happens before this step so duplicate answers can remain.

Expected behavior

Ideally, the mapping from token to word should happen at around this point. You would have a start and end probability for each word. If there are multiple tokens in a word, their probabilities should be summed. This would make the probabilities more correct because every token in the word would affect the probability of selecting the word.

If this is too slow, there should at least be a check for duplicates somewhere here. This would mean that you are not guaranteed to get k answers when setting top_k, but only that you get at most k answers. A way to mitigate that somewhat (but not perfectly), would be to use a higher value than top_k when calling select_starts_ends here.

Sep 20 '23 10:09 MichelBartels

cc @Rocketknight1 if you can have a look! 🤗

Sep 27 '23 10:09 ArthurZucker

@michelbartels I managed to reproduce the bug, and I think your diagnosis of it is completely correct. Would you be interested in filing a PR with your solution? If you don't have time, that's fine! Just let us know and we'll put it on the list to get fixed internally.

Sep 27 '23 12:09 Rocketknight1

@Rocketknight1 Thanks for looking into this, I am afraid I currently don't have time to contribute a fix.

Sep 28 '23 13:09 MichelBartels

@MichelBartels No problem! Thanks for the clean bug report anyway - I'll let you know when we have a fix.

Sep 28 '23 14:09 Rocketknight1

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 23 '23 06:10 BCreativeS

@BCreativeS , the issue still needs to be addressed.

Oct 23 '23 08:10 MichelBartels

This issue still needs to be addressed.

Dec 12 '23 09:12 MichelBartels

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jan 14 '24 08:01 github-actions[bot]

This issue still needs to be addressed.

Jan 14 '24 22:01 MichelBartels

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 08 '24 08:02 github-actions[bot]

This issue still needs to be addressed.

Feb 08 '24 08:02 MichelBartels

cc @Rocketknight1 if you can have a look! Would be nice 🤗

Feb 12 '24 05:02 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 07 '24 08:03 github-actions[bot]

This issue still needs to be addressed.

Mar 07 '24 08:03 MichelBartels

@Rocketknight1 probably didn't have time to have a look, feel free to have a go if anyone feels like it!

Mar 25 '24 03:03 ArthurZucker

Yeah, it's on my list, but it's fairly low priority and the list is long, I'm sorry!

Mar 25 '24 14:03 Rocketknight1

Heys @ArthurZucker and @MichelBartels , I made a PR for the issue: https://github.com/huggingface/transformers/pull/38799

I implemented a subword probability summing logic for calculating start and end probabilities. Code functions well, however it's long and (sort of) ugly. Please make the review, if you like the changes, I'll go benchmark the performance.

Jun 12 '25 18:06 DuyguA