transformers icon indicating copy to clipboard operation
transformers copied to clipboard

`align_to_words=True` in `QuestionAnsweringPipeline` can lead to duplicate answers

Open MichelBartels opened this issue 2 years ago • 17 comments

System Info

  • transformers version: 4.31.0
  • Platform: macOS-13.4.1-arm64-arm-64bit
  • Python version: 3.11.4
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • Accelerate version: 0.21.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@Nars

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

from transformers import pipeline

answers = pipeline("question-answering", model="deepset/tinyroberta-squad2")(
    question="Who is the chancellor of Germany?",
    context="Angela Merkel was the chancellor of Germany.",
    top_k=10
)
print(answers[0]) # Returns {'score': 0.9961308836936951, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}
print(answers[5]) # Returns {'score': 7.520078361267224e-05, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}

If align_to_words is set to True (which is the default), all start or end tokens that are contained in the same word are mapped to the same start and end character index (see here). This is expected when using align_to_words. However, the top_k filtering happens before this step so duplicate answers can remain.

Expected behavior

Ideally, the mapping from token to word should happen at around this point. You would have a start and end probability for each word. If there are multiple tokens in a word, their probabilities should be summed. This would make the probabilities more correct because every token in the word would affect the probability of selecting the word.

If this is too slow, there should at least be a check for duplicates somewhere here. This would mean that you are not guaranteed to get k answers when setting top_k, but only that you get at most k answers. A way to mitigate that somewhat (but not perfectly), would be to use a higher value than top_k when calling select_starts_ends here.

MichelBartels avatar Sep 20 '23 10:09 MichelBartels

cc @Rocketknight1 if you can have a look! 🤗

ArthurZucker avatar Sep 27 '23 10:09 ArthurZucker

@michelbartels I managed to reproduce the bug, and I think your diagnosis of it is completely correct. Would you be interested in filing a PR with your solution? If you don't have time, that's fine! Just let us know and we'll put it on the list to get fixed internally.

Rocketknight1 avatar Sep 27 '23 12:09 Rocketknight1

@Rocketknight1 Thanks for looking into this, I am afraid I currently don't have time to contribute a fix.

MichelBartels avatar Sep 28 '23 13:09 MichelBartels

@MichelBartels No problem! Thanks for the clean bug report anyway - I'll let you know when we have a fix.

Rocketknight1 avatar Sep 28 '23 14:09 Rocketknight1

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

BCreativeS avatar Oct 23 '23 06:10 BCreativeS

@BCreativeS , the issue still needs to be addressed.

MichelBartels avatar Oct 23 '23 08:10 MichelBartels

This issue still needs to be addressed.

MichelBartels avatar Dec 12 '23 09:12 MichelBartels

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jan 14 '24 08:01 github-actions[bot]

This issue still needs to be addressed.

MichelBartels avatar Jan 14 '24 22:01 MichelBartels

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 08 '24 08:02 github-actions[bot]

This issue still needs to be addressed.

MichelBartels avatar Feb 08 '24 08:02 MichelBartels

cc @Rocketknight1 if you can have a look! Would be nice 🤗

ArthurZucker avatar Feb 12 '24 05:02 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 07 '24 08:03 github-actions[bot]

This issue still needs to be addressed.

MichelBartels avatar Mar 07 '24 08:03 MichelBartels

@Rocketknight1 probably didn't have time to have a look, feel free to have a go if anyone feels like it!

ArthurZucker avatar Mar 25 '24 03:03 ArthurZucker

Yeah, it's on my list, but it's fairly low priority and the list is long, I'm sorry!

Rocketknight1 avatar Mar 25 '24 14:03 Rocketknight1

Heys @ArthurZucker and @MichelBartels , I made a PR for the issue: https://github.com/huggingface/transformers/pull/38799

I implemented a subword probability summing logic for calculating start and end probabilities. Code functions well, however it's long and (sort of) ugly. Please make the review, if you like the changes, I'll go benchmark the performance.

DuyguA avatar Jun 12 '25 18:06 DuyguA