transformers `align_to_words=True` in `QuestionAnsweringPipeline` can lead to duplicate answers

`align_to_words=True` in `QuestionAnsweringPipeline` can lead to duplicate answers

Open MichelBartels opened this issue 2 years ago • 17 comments

System Info

transformers version: 4.31.0
Platform: macOS-13.4.1-arm64-arm-64bit
Python version: 3.11.4
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@Nars

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import pipeline

answers = pipeline("question-answering", model="deepset/tinyroberta-squad2")(
    question="Who is the chancellor of Germany?",
    context="Angela Merkel was the chancellor of Germany.",
    top_k=10
)
print(answers[0]) # Returns {'score': 0.9961308836936951, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}
print(answers[5]) # Returns {'score': 7.520078361267224e-05, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}

If align_to_words is set to True (which is the default), all start or end tokens that are contained in the same word are mapped to the same start and end character index (see here). This is expected when using align_to_words. However, the top_k filtering happens before this step so duplicate answers can remain.

Expected behavior

Ideally, the mapping from token to word should happen at around this point. You would have a start and end probability for each word. If there are multiple tokens in a word, their probabilities should be summed. This would make the probabilities more correct because every token in the word would affect the probability of selecting the word.

If this is too slow, there should at least be a check for duplicates somewhere here. This would mean that you are not guaranteed to get k answers when setting top_k, but only that you get at most k answers. A way to mitigate that somewhat (but not perfectly), would be to use a higher value than top_k when calling select_starts_ends here.

Sep 20 '23 10:09 MichelBartels

transformers transformers copied to clipboard

`align_to_words=True` in `QuestionAnsweringPipeline` can lead to duplicate answers

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard