cdQA Still get the wrong answer

Hello , We have used your cdQA to test our squad V2.0 dataset , we can have the correct paragraphs. But then it gets the wrong answer all the time. We have checked the retriever. You have tfidf and BM25, the question is , how to improve the accuracy to get the correct answer ? Thank you Best Regards, Jonathan

Nov 28 '19 03:11 laifuchicago

Hello,

I am afraid cdQA does not perform well with a dataset with a structure similar to squad 2.0, where there are some questions with no answers. I haven't tested it yet to confirm my hypothesis, but I have some reasons to think why it wouldn't work:

Let's say you use a Reader (eg. BERT) fine-tuned on squad 2.0 and it correctly predicts no_answer given a question and a paragraph that has no answer for the question. Now let's use this Reader in the whole cdQA pipeline. When a question is sent to the pipeline, the Retriever will select, let's say, 20 paragraphs that might contain the answer. Let's suppose that only one of these paragraphs might contain the correct answer, but the Reader (BERT) is not that sure about it (it doesn't output a very high probability for it). As the other 19 paragraphs don't have the answer with very high probability, the Reader will output no_answer for all of these 19 paragraphs with a high score. When we get to the Ranker, it will receive 19 no_answer with high probability and the only correct answer with a probability not that high. So finally the Ranker will output no_answer most of the time. I am not saying the pipeline will always behave like that, but think about the probabilities here: 1 correct answer competing with 19 no_answer with high probability, we will have often bad results.

So, how can we know and calibrate the pipeline so that when we ask a question that cannot be answered with the documents in the database, the pipeline outputs no_answer?

One idea that I have, that I have not had the time to test yet is to use a Reader that is not fine-tuned on SQuAD 2.0 (eg. the models we made available with this library) and, by testing a set of questions with answers and another one with no answers, find a threshold for the final score outputted by the pipeline (it's the 4th term in the tuple outputted by cdqa_pipeline.predict()). When you find this threshold you can just do an if-else statement every time you try to predict on a question: if score > threshold -> output answer; else -> output "there is no answer".

Nov 28 '19 08:11 andrelmfarias

Hello, We only use the context part in squad v2.0 dataset, and trying to build our own QA system.

Nov 28 '19 09:11 laifuchicago

@laifuchicago ,

Sorry, but I am not sure I understand what you are doing exactly, could you be more precise? Can you describe what are the steps you are following?

Nov 28 '19 09:11 andrelmfarias

Hello, We just loaded the paragraphs from Squad v2, using the pdf_converter.

Nov 28 '19 09:11 laifuchicago

Thanks,

Could you please show the dataframe outputed by the pdf_converter? Just do a .head(), the first 5 rows is ok.

Could you also please show an example of an incorrect answer? (The question and the output, please)

Nov 28 '19 10:11 andrelmfarias

cdqa1 cdqa2 cdqa3

Nov 29 '19 00:11 laifuchicago

That's a bit weird, I have just tried to do the same on the train set of SQuAD 2.0, and I got:

There are two things you can do to try to mitigate that, vary the parameter retriever_score_weight of the method .predict() (which is 0.35 by default) and try to see if you get better results; or you can set the parameters n_predictions to more than 1, to get a list with the n most probable results, like so:

But please be aware that the system is not perfect and it is not always going to give the best answers. There are still several improvements to be done, but we are really dependent on the Reader's performance, which is not perfect (c. 81% EM in the case of BERT base) even if you send to it only the paragraph with the correct answer.

Nov 29 '19 08:11 andrelmfarias

Thank you, we will test on Monday. We tried 20 questions, and 18 of them are correct.

Nov 29 '19 09:11 laifuchicago

90% is a very high performance actually...

The system is expected to have less than 80% of the questions answered correctly.

Nov 29 '19 09:11 andrelmfarias

To Andre: Now we test the Squad v2 data in csv, but it has some errors just like below. Can you help us? Thank you! csverror

Dec 02 '19 04:12 laifuchicago

@laifuchicago Not sure if this would help but I also had the same error and this is how I solved it-

I changed literal_eval to eval since it is a trusted input.
The data in the dataframe should be a list data structure.
Every paragraph should be enclosed in double-quotes followed by single quotes, i.e " followed by '

Jan 06 '20 09:01 SmritiSatyan

cdQA cdQA copied to clipboard

Still get the wrong answer

cdQA
cdQA copied to clipboard