haystack Training of FARMReader uses too many and potentially wrong no answer labels due to bug in SquadProcessor

Describe the bug

When training a QA model with the FARMReader.train() method, the SquadProcessor will be used to convert the SQuAD-style json file to training samples.

The context for a question might be longer than the model's token limit. Therefore, the processor splits the full context into smaller passages. It then checks, if the original answer is present in a passage using its character positions. If the answer is not present in the passage it automatically uses the sample as a no-answer sample.

The code is here: https://github.com/deepset-ai/haystack/blob/a2905d05f798ea3335596247b98ec711eb6cd542/haystack/modeling/data_handler/processor.py#L643

This creates multiple issues:

the user is not aware of this behaviour
for long documents, there is too many no-answer samples
the answer might be present in the passage but it was not labeled

Expected behavior

Give the user a parameter max_no_answer_per_context where they can decide how many no-answer samples should be created.
Check if the actual answer has a string match in that passage and never use samples as no-answer sample if there is a string match/overlap

Jul 06 '22 13:07 mathislucka

Hi @julian-risch it sounds like we agree that we would like to have control over how many no-answer labels are created during training in a FARMReader model, so the implementation of something like max_no_answer_per_context. As for the second suggestion, it sounds like we thought it would be best to perhaps print it as a warning message instead of removing the no-answer sample.

In regards to evaluation, the issue outlined here does not affect evaluation because it was determined that evaluation does aggregate results per file which is why Michel closed the issue https://github.com/deepset-ai/haystack/issues/2622.

Aug 03 '22 14:08 sjrl

Additionally, we also agreed that adding additional documentation to the FARMReader training to explain how no-answer labels are automatically generated would be very helpful.

Aug 03 '22 14:08 sjrl

Alright, thank you. I'll tag @brandenchan and @agnieszka-m so that they also learn about the needed documentation update .

Aug 03 '22 14:08 julian-risch