haystack Reproducing deepset/roberta-base-squad2 metrics on HuggingFace

Describe the bug This is not really a bug, but an investigation as to why the metrics reported for deepset/roberta-base-squad2 on HuggingFace are not reproducible using the FARMReader.eval_on_file function in Haystack version 1.6.0.

Metrics on Huggingface

"exact": 79.87029394424324,
"f1": 82.91251169582613,

"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945

Expected behavior The following code should produce metrics similar to the ones reported on HuggingFace

from haystack.nodes import FARMReader
reader = FARMReader(
  model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386
)
metrics = reader.eval_on_file(
    data_dir='./data/',
    test_filename='dev-v2.0.json',
    device='cuda',
)

Data for eval is available here (official SQuAD2 dev set): https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Currently, Haystack v1.6.0 produces

    "v1.6.0": {
      "EM": 76.13913922344815,
      "f1": 78.1999299751871,
      "top_n_accuracy": 97.45641371178304,
      "top_n": 4,
      "EM_text_answer": 61.52159244264508,
      "f1_text_answer": 65.64908377115323,
      "top_n_accuracy_text_answer": 94.90553306342781,
      "top_n_EM_text_answer": 81.34278002699055,
      "top_n_f1_text_answer": 90.69119695344651,
      "Total_text_answer": 5928,
      "EM_no_answer": 90.71488645920942,
      "f1_no_answer": 90.71488645920942,
      "top_n_accuracy_no_answer": 100.0,
      "Total_no_answer": 5945
    },

and we can immediately see that the EM (79.9 vs. 76.1) and F1 (82.9 vs. 78.2) scores are lower than expected.

If we roll back to Haystack v1.3.0 we get the following metrics

    "v1.3.0": {
      "EM": 0.7841320643476796,
      "f1": 0.826529420882385,
      "top_n_accuracy": 0.9741430135601785
    }

which much more closely match the ones reported on HuggingFace.

Identifying the change in Haystack I believe the code change that caused this difference in results was introduced by the solution to this issue https://github.com/deepset-ai/haystack/issues/2410. To check this I updated FARMReader.eval_on_file to accept the argument use_no_answer_legacy_confidence (which was originally added in this PR https://github.com/deepset-ai/haystack/pull/2414/files for FARMReader.eval but not FARMReader.eval_on_file) in the branch https://github.com/deepset-ai/haystack/tree/issue_farmreader_eval and ran the following code

results = reader.eval_on_file(
    data_dir='./data/',
    test_filename='dev-v2.0.json',
    device='cuda',
    use_no_answer_legacy_confidence=True
)

which produced

{'EM': 78.37951655015581,
 'f1': 82.61101723722291,
 'top_n_accuracy': 97.41430135601786,
 'top_n': 4,
 'EM_text_answer': 75.0,
 'f1_text_answer': 83.47513624452559,
 'top_n_accuracy_text_answer': 94.82118758434548,
 'top_n_EM_text_answer': 81.00539811066126,
 'top_n_f1_text_answer': 90.50564293271806,
 'Total_text_answer': 5928,
 'EM_no_answer': 81.74936921783012,
 'f1_no_answer': 81.74936921783012,
 'top_n_accuracy_no_answer': 100.0,
 'Total_no_answer': 5945}

The EM and F1 scores now show the expected values.

Solution I'm not entirely sure how this should be resolved, but it seems the no_answer logic used in eval should probably be somehow linked to the models that were trained using that logic.

~~[ ] Consider saving the value of use_no_answer_legacy_confidence with model meta data so when the model is reloaded it uses the same no_answer confidence logic as it was trained with.~~ This would not solve the issue as discussed below.

FAQ Check

[x] Have you had a look at our new FAQ page?

System:

OS: ubuntu
GPU/CPU: GPU
Haystack version (commit or version number): v1.6.0

Jul 20 '22 10:07 sjrl

@tstadel @Timoeller Would it be possible to save the value of use_no_answer_legacy_confidence with model metadata so that when the model is reloaded it uses the same no_answer confidence logic as it was trained with?

Jul 20 '22 12:07 sjrl

@sjrl I think it's not really solving it as this would only fix the eval() and eval_on_file() methods but not FARMReader's predict().

Jul 20 '22 12:07 tstadel

@tstadel Yes, my PR is incomplete. I just thought to open it so the work would not need to be reproduced. So feel free to add commits or if it's easier we can close it as well.

Jul 20 '22 12:07 sjrl

I think it's not really solving it as this would only fix the eval() and eval_on_file() methods but not FARMReader's predict().

Oh sorry, I misinterpreted what you meant. I'm still not too familiar with the source code of FARMReader 😅 I agree the proposed solution wouldn't fix the predict() method.

Jul 20 '22 12:07 sjrl

I think saving the value of use_no_answer_legacy_confidence doesn't work for all cases because it's not set during training and only relevant during postprocessing. The issue it is addressing is that we can't compare the probabilities of different documents and splits (documents whose sequence length is longer than max_seq_length of the reader) because they are created using softmax which normalizes the scores separately for each split/document. This issue is not really present during eval because we have just one document which is usually small enough so it doesn't need to be split. This means that for eval use_no_answer_legacy_confidence=True works fine while it probably does not during inference. However, just applying sigmoid to the no_answer scores which is done when use_no_answer_legacy_confidence is False also creates issues as all other values are still scaled by softmax. This is mitigated a bit by dividing by 8, but that's not an ideal solution.

Jul 20 '22 12:07 MichelBartels

Just a quick status from my side. I'm still investigating: my goal is to have a clear view on what this means for:

Training
Eval
Inference in Haystack

If we have that let's focus on what needs to be fixed or should be parameterized in any way.

Jul 21 '22 08:07 tstadel

Ok, so here's the result of my analysis, most of it confirms @MichelBartels. However, besides use_no_answer_legacy_confidence the eval and inference results are also affected by use_confidence_scores/use_confidence_scores_for_ranking. Both default values have changed for eval in #2410:

Training: Not affected at all as the order of answers is not used for loss calculation, but raw logits with CrossEntropyLoss. However the dev eval function uses the same as FARMReader.eval/eval_on_file.

Eval: use_no_answer_legacy_confidence only affects ranking/results of FARMReader.eval/eval_on_file but only if use_confidence_scores=True (False was the default before #2410 and I guess also in FARM). Hence, Dev eval during training is also affected. #2410 setuse_no_answer_legacy_confidence's default to False and use_confidence_scores_for_ranking's default to True as this is the behavior that we get from FARMReader during inference.

Inference: Currently we only support use_no_answer_legacy_confidence=False, but this could be changed for FARMReader, . However not so for TransformersReader as here we do not have access to confidence values. However we can set use_confidence_scores to False in FARMReader. With that we only use raw logit scores, which was the same behavior of FARMReader.eval/eval_on_file before my changes in PR #2410. However unfortunately we do not pass that parameter correctly down to Evaluator.eval() in eval_on_file. That's already addressed in #2856.

My suggestion would be to fix the passing of FARMReader's use_confidence_scores down to the Evaluator and with setting that to False we should get the same results as in FARM. For inference you can basically choose the param that best fits your data. Setting use_confidence_scoresto True seems to favor no_answers. Setting it to False is more aggressive in showing actual answers. At least that's m my impression from our training dataset in Tutorial 2 und the eval dataset in Tutorial 5.

Jul 21 '22 17:07 tstadel

I ran eval_on_file with use_confidence_scores=False on #2856 branch like this:

from haystack.nodes import FARMReader

reader = FARMReader(
  model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386, use_confidence_scores=False
)
metrics = reader.eval_on_file(
    data_dir='/home/tstad/Downloads',
    test_filename='dev-v2.0.json',
)

and here are the results:

{
    'EM': 78.43847384822708, 
    'f1': 82.65766204593291, 
    'top_n_accuracy': 97.41430135601786, 
    'top_n': 4, 
    'EM_text_answer': 75.01686909581646, 
    'f1_text_answer': 83.46734505252387, 
    'top_n_accuracy_text_answer': 94.82118758434548, 
    'top_n_EM_text_answer': 80.9885290148448, 
    'top_n_f1_text_answer': 90.49777068800371, 
    'Total_text_answer': 5928, 
    'EM_no_answer': 81.85029436501262, 
    'f1_no_answer': 81.85029436501262, 
    'top_n_accuracy_no_answer': 100.0, 
    'Total_no_answer': 5945
}

Jul 22 '22 08:07 tstadel

@tstadel Nice work! Thanks for the thorough investigation.

My suggestion would be to fix the passing of FARMReader's use_confidence_scores down to the Evaluator and with setting that to False we should get the same results as in FARM.

I also agree that we should pass down the parameters to Evaluator.eval(). Are you suggesting we should also set the default value of use_confidence_scores to False for eval and leave it to True for inference?

Jul 22 '22 08:07 sjrl

Whatever defaults we decide on, @julian-risch mentioned that we should include docstrings and docs on the website explaining when to set use_confidence_scores either True or False, which I think sounds like a great idea.

Jul 22 '22 11:07 sjrl

Yes definitely! We have to mention that setting use_confidence_scores affects ranking of no_answers. Here's what I'd suggest:

complete https://github.com/deepset-ai/haystack/pull/2856 with adding a new param use_confidence_scores_for_ranking to eval() and eval_on_file(). This param should be True by default for reproducible. In docstrings it should be clear that this will be used instead of FARMReader's use_confidence_scores. Add to use_confidence_scores's docstring that it affects ranking of no_answers. @sjrl I can support you on that
explain the behavior of use_confidence_scores on the website too (probably for @agnieszka-m once our docstrings are ready)
Create an issue about improving the handling of no_answer scores when using use_confidence_scores. This could be solved by improving the no_answer confidence calculation, however there are cases where the current default values perform better. Sometimes it seems to behave very odd: if there is one answer with confidence >95 and a no_answer with originally calculated confidence of ~1%, we get a no_answer confidence of +50%. So the current default values seems to heavily favor no_answer values. Let's investigate on whether this is the case and whether this bias towards no_answers could be achieved otherwise easier (e.g. by setting no_answer_boost to a different default value) without losing ranking consistency between True and False of use_confidence_scores. (that one is on me)

@sjrl WDYT?

Jul 22 '22 16:07 tstadel

Hi @tstadel I definitely agree with your second and third points. However, I'm a little uncertain about the first suggestion about adding use_confidence_scores_for_ranking to eval() and eval_on_file().

It seems unintuitive to me we would then have use_confidence_scores to initialize a FARMReader but then also the separate variable use_confidence_scores_for_ranking specifically only for the eval and eval_on_file functions. In this situation, the use_confidence_scores variable would only affect the predict function, which could lead to a situation where predict could return different results than eval and eval_on_file.

So I would prefer the following:

Remove the variable use_no_answer_legacy_confidence from FARMReader.eval() and FARMReader.eval_on_file() and instead add to to the constructor of FARMReader. And then pass on the class variables to the Evaluator.eval() function.
Update this line https://github.com/deepset-ai/haystack/blob/master/haystack/modeling/training/base.py#L308 to pass on the variables use_confidence_scores and use_no_answer_legacy_confidence from self.model if they are present. This way the eval results printed during training are consistent with evaluations performed after training is finished.

@tstadel WDYT?

Jul 25 '22 10:07 sjrl

@sjrl Adding use_no_answer_legacy_confidence to the constructor would introduce more confusion in my opinion. Even if we correctly pass it down to the prediction head it won't have any effect for the happy path of FARMReader (=inference). So this param is basically useless outside of eval[_on_file()] (which are also used during training). So I'd definitely keep it as a function param (and maybe add it to train() too). To streamline the use-case "expert eval of FARMReader" further, I would add use_confidence_scores_for_ranking to the eval functions, so you always know what you actually get. Currently it's not intuitive that the constructor param use_confidence_scores would also affect ranking and thus the eval functions, even if we adjust its docstrings. Additionally this would make comparing this settings easier as you can just call eval[on_file] twice on the same reader instance without having to instantiate a new reader. Although this might seem a bit cumbersome in the first place and needs a bit more explanation in the docstrings, this appears to me as the right compromise here: eval[on_file] really is the expert path for evaluation. We actually nudge the users towards the more realistic pipeline.eval() with the warning at the beginning of eval[on_file], because that's the actual performance you get when using it in a pipeline. And to further improve that real performance we address that in the issue of point 3.

Jul 26 '22 08:07 tstadel

@sjrl are we good with this plan? Can we close this issue?

Aug 02 '22 11:08 tstadel

After some discussion with @Timoeller, @ju-gu, @mathislucka and @tstadel we agreed that we would like the FARMReader.eval and FARMReader.eval_on_file functions to behave the same way as the FARMReader.predict functions.

So this means we will remove the use_no_answer_legacy_confidence, and use_confidence_scores_for_ranking options from the eval and eval_on_file functions and instead use the class variables of FARMReader to determine these values.

Aug 09 '22 15:08 sjrl

This has been resolved with PR #2856.

To be clear reproducing the metrics reported on HF only works if use_confidence_scores is set to False when initializing a FARMReader in Haystack.

Aug 11 '22 14:08 sjrl

haystack haystack copied to clipboard

Reproducing deepset/roberta-base-squad2 metrics on HuggingFace

haystack
haystack copied to clipboard