haystack
haystack copied to clipboard
Reproducing deepset/roberta-base-squad2 metrics on HuggingFace
Describe the bug
This is not really a bug, but an investigation as to why the metrics reported for deepset/roberta-base-squad2
on HuggingFace are not reproducible using the FARMReader.eval_on_file
function in Haystack version 1.6.0.
Metrics on Huggingface
"exact": 79.87029394424324,
"f1": 82.91251169582613,
"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945
Expected behavior The following code should produce metrics similar to the ones reported on HuggingFace
from haystack.nodes import FARMReader
reader = FARMReader(
model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386
)
metrics = reader.eval_on_file(
data_dir='./data/',
test_filename='dev-v2.0.json',
device='cuda',
)
Data for eval is available here (official SQuAD2 dev set): https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Currently, Haystack v1.6.0 produces
"v1.6.0": {
"EM": 76.13913922344815,
"f1": 78.1999299751871,
"top_n_accuracy": 97.45641371178304,
"top_n": 4,
"EM_text_answer": 61.52159244264508,
"f1_text_answer": 65.64908377115323,
"top_n_accuracy_text_answer": 94.90553306342781,
"top_n_EM_text_answer": 81.34278002699055,
"top_n_f1_text_answer": 90.69119695344651,
"Total_text_answer": 5928,
"EM_no_answer": 90.71488645920942,
"f1_no_answer": 90.71488645920942,
"top_n_accuracy_no_answer": 100.0,
"Total_no_answer": 5945
},
and we can immediately see that the EM (79.9 vs. 76.1) and F1 (82.9 vs. 78.2) scores are lower than expected.
If we roll back to Haystack v1.3.0 we get the following metrics
"v1.3.0": {
"EM": 0.7841320643476796,
"f1": 0.826529420882385,
"top_n_accuracy": 0.9741430135601785
}
which much more closely match the ones reported on HuggingFace.
Identifying the change in Haystack
I believe the code change that caused this difference in results was introduced by the solution to this issue https://github.com/deepset-ai/haystack/issues/2410. To check this I updated FARMReader.eval_on_file
to accept the argument use_no_answer_legacy_confidence
(which was originally added in this PR https://github.com/deepset-ai/haystack/pull/2414/files for FARMReader.eval
but not FARMReader.eval_on_file
) in the branch https://github.com/deepset-ai/haystack/tree/issue_farmreader_eval and ran the following code
results = reader.eval_on_file(
data_dir='./data/',
test_filename='dev-v2.0.json',
device='cuda',
use_no_answer_legacy_confidence=True
)
which produced
{'EM': 78.37951655015581,
'f1': 82.61101723722291,
'top_n_accuracy': 97.41430135601786,
'top_n': 4,
'EM_text_answer': 75.0,
'f1_text_answer': 83.47513624452559,
'top_n_accuracy_text_answer': 94.82118758434548,
'top_n_EM_text_answer': 81.00539811066126,
'top_n_f1_text_answer': 90.50564293271806,
'Total_text_answer': 5928,
'EM_no_answer': 81.74936921783012,
'f1_no_answer': 81.74936921783012,
'top_n_accuracy_no_answer': 100.0,
'Total_no_answer': 5945}
The EM and F1 scores now show the expected values.
Solution
I'm not entirely sure how this should be resolved, but it seems the no_answer
logic used in eval should probably be somehow linked to the models that were trained using that logic.
- ~~[ ] Consider saving the value of
use_no_answer_legacy_confidence
with model meta data so when the model is reloaded it uses the sameno_answer
confidence logic as it was trained with.~~ This would not solve the issue as discussed below.
FAQ Check
- [x] Have you had a look at our new FAQ page?
System:
- OS: ubuntu
- GPU/CPU: GPU
- Haystack version (commit or version number): v1.6.0
@tstadel @Timoeller Would it be possible to save the value of use_no_answer_legacy_confidence
with model metadata so that when the model is reloaded it uses the same no_answer
confidence logic as it was trained with?
@sjrl I think it's not really solving it as this would only fix the eval() and eval_on_file() methods but not FARMReader
's predict()
.
@tstadel Yes, my PR is incomplete. I just thought to open it so the work would not need to be reproduced. So feel free to add commits or if it's easier we can close it as well.
I think it's not really solving it as this would only fix the eval() and eval_on_file() methods but not
FARMReader
'spredict()
.
Oh sorry, I misinterpreted what you meant. I'm still not too familiar with the source code of FARMReader
😅 I agree the proposed solution wouldn't fix the predict()
method.
I think saving the value of use_no_answer_legacy_confidence
doesn't work for all cases because it's not set during training and only relevant during postprocessing.
The issue it is addressing is that we can't compare the probabilities of different documents and splits (documents whose sequence length is longer than max_seq_length of the reader) because they are created using softmax which normalizes the scores separately for each split/document. This issue is not really present during eval because we have just one document which is usually small enough so it doesn't need to be split. This means that for eval use_no_answer_legacy_confidence=True
works fine while it probably does not during inference.
However, just applying sigmoid to the no_answer scores which is done when use_no_answer_legacy_confidence
is False also creates issues as all other values are still scaled by softmax. This is mitigated a bit by dividing by 8, but that's not an ideal solution.
Just a quick status from my side. I'm still investigating: my goal is to have a clear view on what this means for:
- Training
- Eval
- Inference in Haystack
If we have that let's focus on what needs to be fixed or should be parameterized in any way.
Ok,
so here's the result of my analysis, most of it confirms @MichelBartels. However, besides use_no_answer_legacy_confidence
the eval and inference results are also affected by use_confidence_scores
/use_confidence_scores_for_ranking
. Both default values have changed for eval in #2410:
Training: Not affected at all as the order of answers is not used for loss calculation, but raw logits with CrossEntropyLoss. However the dev eval function uses the same as FARMReader.eval/eval_on_file.
Eval:
use_no_answer_legacy_confidence
only affects ranking/results of FARMReader.eval/eval_on_file
but only if use_confidence_scores=True
(False
was the default before #2410 and I guess also in FARM). Hence, Dev eval during training is also affected. #2410 setuse_no_answer_legacy_confidence
's default to False
and use_confidence_scores_for_ranking
's default to True as this is the behavior that we get from FARMReader
during inference.
Inference:
Currently we only support use_no_answer_legacy_confidence=False
, but this could be changed for FARMReader, . However not so for TransformersReader as here we do not have access to confidence values.
However we can set use_confidence_scores
to False in FARMReader. With that we only use raw logit scores, which was the same behavior of FARMReader.eval/eval_on_file
before my changes in PR #2410. However unfortunately we do not pass that parameter correctly down to Evaluator.eval()
in eval_on_file. That's already addressed in #2856.
My suggestion would be to fix the passing of FARMReader
's use_confidence_scores
down to the Evaluator and with setting that to False we should get the same results as in FARM.
For inference you can basically choose the param that best fits your data. Setting use_confidence_scores
to True seems to favor no_answers. Setting it to False is more aggressive in showing actual answers. At least that's m my impression from our training dataset in Tutorial 2 und the eval dataset in Tutorial 5.
I ran eval_on_file with use_confidence_scores=False
on #2856 branch like this:
from haystack.nodes import FARMReader
reader = FARMReader(
model_name_or_path='deepset/roberta-base-squad2', use_gpu=True, max_seq_len=386, use_confidence_scores=False
)
metrics = reader.eval_on_file(
data_dir='/home/tstad/Downloads',
test_filename='dev-v2.0.json',
)
and here are the results:
{
'EM': 78.43847384822708,
'f1': 82.65766204593291,
'top_n_accuracy': 97.41430135601786,
'top_n': 4,
'EM_text_answer': 75.01686909581646,
'f1_text_answer': 83.46734505252387,
'top_n_accuracy_text_answer': 94.82118758434548,
'top_n_EM_text_answer': 80.9885290148448,
'top_n_f1_text_answer': 90.49777068800371,
'Total_text_answer': 5928,
'EM_no_answer': 81.85029436501262,
'f1_no_answer': 81.85029436501262,
'top_n_accuracy_no_answer': 100.0,
'Total_no_answer': 5945
}
@tstadel Nice work! Thanks for the thorough investigation.
My suggestion would be to fix the passing of FARMReader's
use_confidence_scores
down to the Evaluator and with setting that to False we should get the same results as in FARM.
I also agree that we should pass down the parameters to Evaluator.eval()
. Are you suggesting we should also set the default value of use_confidence_scores
to False
for eval and leave it to True
for inference?
Whatever defaults we decide on, @julian-risch mentioned that we should include docstrings and docs on the website explaining when to set use_confidence_scores
either True
or False
, which I think sounds like a great idea.
Yes definitely! We have to mention that setting use_confidence_scores
affects ranking of no_answers. Here's what I'd suggest:
- complete https://github.com/deepset-ai/haystack/pull/2856 with adding a new param
use_confidence_scores_for_ranking
toeval()
andeval_on_file()
. This param should beTrue
by default for reproducible. In docstrings it should be clear that this will be used instead ofFARMReader
'suse_confidence_scores
. Add touse_confidence_scores
's docstring that it affects ranking of no_answers. @sjrl I can support you on that - explain the behavior of
use_confidence_scores
on the website too (probably for @agnieszka-m once our docstrings are ready) - Create an issue about improving the handling of no_answer scores when using
use_confidence_scores
. This could be solved by improving the no_answer confidence calculation, however there are cases where the current default values perform better. Sometimes it seems to behave very odd: if there is one answer with confidence >95 and a no_answer with originally calculated confidence of ~1%, we get a no_answer confidence of +50%. So the current default values seems to heavily favor no_answer values. Let's investigate on whether this is the case and whether this bias towards no_answers could be achieved otherwise easier (e.g. by setting no_answer_boost to a different default value) without losing ranking consistency betweenTrue
andFalse
ofuse_confidence_scores
. (that one is on me)
@sjrl WDYT?
Hi @tstadel I definitely agree with your second and third points. However, I'm a little uncertain about the first suggestion about adding use_confidence_scores_for_ranking
to eval()
and eval_on_file()
.
It seems unintuitive to me we would then have use_confidence_scores
to initialize a FARMReader but then also the separate variable use_confidence_scores_for_ranking
specifically only for the eval
and eval_on_file
functions. In this situation, the use_confidence_scores
variable would only affect the predict
function, which could lead to a situation where predict
could return different results than eval
and eval_on_file
.
So I would prefer the following:
- Remove the variable
use_no_answer_legacy_confidence
fromFARMReader.eval()
andFARMReader.eval_on_file()
and instead add to to the constructor ofFARMReader
. And then pass on the class variables to theEvaluator.eval()
function. - Update this line https://github.com/deepset-ai/haystack/blob/master/haystack/modeling/training/base.py#L308 to pass on the variables
use_confidence_scores
anduse_no_answer_legacy_confidence
fromself.model
if they are present. This way the eval results printed during training are consistent with evaluations performed after training is finished.
@tstadel WDYT?
@sjrl Adding use_no_answer_legacy_confidence
to the constructor would introduce more confusion in my opinion. Even if we correctly pass it down to the prediction head it won't have any effect for the happy path of FARMReader (=inference). So this param is basically useless outside of eval[_on_file()]
(which are also used during training). So I'd definitely keep it as a function param (and maybe add it to train()
too).
To streamline the use-case "expert eval of FARMReader" further, I would add use_confidence_scores_for_ranking
to the eval functions, so you always know what you actually get. Currently it's not intuitive that the constructor param use_confidence_scores
would also affect ranking and thus the eval functions, even if we adjust its docstrings.
Additionally this would make comparing this settings easier as you can just call eval[on_file]
twice on the same reader instance without having to instantiate a new reader. Although this might seem a bit cumbersome in the first place and needs a bit more explanation in the docstrings, this appears to me as the right compromise here:
eval[on_file]
really is the expert path for evaluation.
We actually nudge the users towards the more realistic pipeline.eval()
with the warning at the beginning of eval[on_file]
, because that's the actual performance you get when using it in a pipeline. And to further improve that real performance we address that in the issue of point 3.
@sjrl are we good with this plan? Can we close this issue?
After some discussion with @Timoeller, @ju-gu, @mathislucka and @tstadel we agreed that we would like the FARMReader.eval
and FARMReader.eval_on_file
functions to behave the same way as the FARMReader.predict
functions.
So this means we will remove the use_no_answer_legacy_confidence
, and use_confidence_scores_for_ranking
options from the eval
and eval_on_file
functions and instead use the class variables of FARMReader to determine these values.
This has been resolved with PR #2856.
To be clear reproducing the metrics reported on HF only works if use_confidence_scores
is set to False
when initializing a FARMReader
in Haystack.