cdQA Return normalized scores

Return normalized scores

Open lewtun opened this issue 4 years ago • 4 comments

Describe the solution you'd like Currently, the result from

cdqa_pipeline.predict(query='your question', n_predictions=N)

is a list of tuples of (answer, title, paragraph, score/logit), where score is calculated as a linear interpolation between the retriever and reader scores. My understanding is that score can take any value on (-inf, inf), but it would be useful if this were normalised on say [0,1].

Would it be possible to have a flag that returns normalized scores? One simple idea would be to apply a sigmoid to the un-normalised scores.

Describe alternatives you've considered Currently, I first generate predictions and then pass them through a sigmoid.

Additional context I would be happy to have go at implementing this myself.

Oct 15 '19 14:10 lewtun

Hi @lewtun ,

Actually, it would be handy to have this feature. I am just asking myself if it makes sense, as I don't know if applying a sigmoid to scores taken from differents paragraphs has a real meaning. For instance, in order to compare answers between different paragraphs we should be using the raw logits from BERT instead of softmax outputs.

But I THINK that it might be ok to do it in the last step of the pipeline, i.e. when comparing best answers between different paragraphs with an interpolation between the retriever and reader scores.

Oct 15 '19 15:10 andrelmfarias

Hi @andrelmfarias,

Thanks for reminding me about the distinction between raw logits / softmax outputs from BERT: does this mean that the score calculated in bertqa_sklearn.py

best_dict["final_score"] = (1 - retriever_score_weight) * (best_dict["start_logit"] + best_dict["end_logit"]) + retriever_score_weight * best_dict["retriever_score"]

is actually bounded on some finite interval, i.e. not on (-inf, inf) as I originally claimed? In other words, what is the range for best_dict["final_score"]?

I agree that applying a sigmoid to the softmax outputs does not make sense, so applying the desired transformation at the end of the pipeline is probably best left to the user.

Oct 16 '19 07:10 lewtun

is actually bounded on some finite interval, i.e. not on (-inf, inf) as I originally claimed?

Actually, your original claim is theoretically true, since the final score is an interpolation between the BM25 score (i.e. the default Retriever score), which is designed to be bounded, and the Reader score, which is the average of raw logits of start and end tokens. In theory, logits are unbounded, so the final score too.

But in practice, I have never seen a logits overflow with pre-trained BERT.

I honestly think we can add this feature to return final scores normalized between the best answers of different paragraphs by a sigmoid. We can leave the option to the user.

But I personally prefer to use the raw final score for comparison though. Here's an extreme example that shows why using the softmax at the end can be misleading:

Let's say I want to find a score threshold where I accept / reject the answers. Let's say I am using this softmax normalization at the end and N=10. Suppose a query Q1 that has an exact answer, and this answer is found in the 10 paragraphs I retrieved at the end. In this case, we can have all softmax probabilities close to 0.1 and if my threshold is for example 0.2, I will reject them. Now suppose a query Q2 that has no exact answer, but one of the possible answers obtained a score much higher than the others (it can be a sort of "almost correct answer"), this can lead to a pretty high final softmax score (let's say 0.6), and I will accept it. So, by using the softmax output as threshold I can reject good answers and accept bad answers. I know this is an extreme case that will rarely happen, but it's just to show why I prefer the raw scores for comparison.

Anyway, it's not a huge deal if we add this feature, so if you would like you can go for it. I will be waiting for your PR 😊

Oct 16 '19 08:10 andrelmfarias

Thanks for the clarification @andrelmfarias - your softmax example was very instructive!

I'll implement the normalisation and submit a PR shortly.

Oct 16 '19 08:10 lewtun

cdQA cdQA copied to clipboard

Return normalized scores

cdQA
cdQA copied to clipboard