transformers
transformers copied to clipboard
`align_to_words=True` in `QuestionAnsweringPipeline` can lead to duplicate answers
System Info
transformersversion: 4.31.0- Platform: macOS-13.4.1-arm64-arm-64bit
- Python version: 3.11.4
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
@Nars
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
from transformers import pipeline
answers = pipeline("question-answering", model="deepset/tinyroberta-squad2")(
question="Who is the chancellor of Germany?",
context="Angela Merkel was the chancellor of Germany.",
top_k=10
)
print(answers[0]) # Returns {'score': 0.9961308836936951, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}
print(answers[5]) # Returns {'score': 7.520078361267224e-05, 'start': 0, 'end': 13, 'answer': 'Angela Merkel'}
If align_to_words is set to True (which is the default), all start or end tokens that are contained in the same word are mapped to the same start and end character index (see here). This is expected when using align_to_words. However, the top_k filtering happens before this step so duplicate answers can remain.
Expected behavior
Ideally, the mapping from token to word should happen at around this point. You would have a start and end probability for each word. If there are multiple tokens in a word, their probabilities should be summed. This would make the probabilities more correct because every token in the word would affect the probability of selecting the word.
If this is too slow, there should at least be a check for duplicates somewhere here. This would mean that you are not guaranteed to get k answers when setting top_k, but only that you get at most k answers. A way to mitigate that somewhat (but not perfectly), would be to use a higher value than top_k when calling select_starts_ends here.