Training of FARMReader uses too many and potentially wrong no answer labels due to bug in SquadProcessor
Describe the bug
When training a QA model with the FARMReader.train() method, the SquadProcessor will be used to convert the SQuAD-style json file to training samples.
The context for a question might be longer than the model's token limit. Therefore, the processor splits the full context into smaller passages. It then checks, if the original answer is present in a passage using its character positions. If the answer is not present in the passage it automatically uses the sample as a no-answer sample.
The code is here: https://github.com/deepset-ai/haystack/blob/a2905d05f798ea3335596247b98ec711eb6cd542/haystack/modeling/data_handler/processor.py#L643
This creates multiple issues:
- the user is not aware of this behaviour
- for long documents, there is too many no-answer samples
- the answer might be present in the passage but it was not labeled
Expected behavior
- Give the user a parameter
max_no_answer_per_contextwhere they can decide how many no-answer samples should be created. - Check if the actual answer has a string match in that passage and never use samples as no-answer sample if there is a string match/overlap
Hi @julian-risch it sounds like we agree that we would like to have control over how many no-answer labels are created during training in a FARMReader model, so the implementation of something like max_no_answer_per_context. As for the second suggestion, it sounds like we thought it would be best to perhaps print it as a warning message instead of removing the no-answer sample.
In regards to evaluation, the issue outlined here does not affect evaluation because it was determined that evaluation does aggregate results per file which is why Michel closed the issue https://github.com/deepset-ai/haystack/issues/2622.
Additionally, we also agreed that adding additional documentation to the FARMReader training to explain how no-answer labels are automatically generated would be very helpful.
Alright, thank you. I'll tag @brandenchan and @agnieszka-m so that they also learn about the needed documentation update .