cdQA
cdQA copied to clipboard
Still get the wrong answer
Hello , We have used your cdQA to test our squad V2.0 dataset , we can have the correct paragraphs. But then it gets the wrong answer all the time. We have checked the retriever. You have tfidf and BM25, the question is , how to improve the accuracy to get the correct answer ? Thank you Best Regards, Jonathan
Hello,
I am afraid cdQA does not perform well with a dataset with a structure similar to squad 2.0, where there are some questions with no answers. I haven't tested it yet to confirm my hypothesis, but I have some reasons to think why it wouldn't work:
Let's say you use a Reader (eg. BERT) fine-tuned on squad 2.0 and it correctly predicts no_answer
given a question and a paragraph that has no answer for the question. Now let's use this Reader in the whole cdQA pipeline. When a question is sent to the pipeline, the Retriever will select, let's say, 20 paragraphs that might contain the answer. Let's suppose that only one of these paragraphs might contain the correct answer, but the Reader (BERT) is not that sure about it (it doesn't output a very high probability for it). As the other 19 paragraphs don't have the answer with very high probability, the Reader will output no_answer
for all of these 19 paragraphs with a high score. When we get to the Ranker, it will receive 19 no_answer
with high probability and the only correct answer with a probability not that high. So finally the Ranker will output no_answer
most of the time.
I am not saying the pipeline will always behave like that, but think about the probabilities here: 1 correct answer competing with 19 no_answer
with high probability, we will have often bad results.
So, how can we know and calibrate the pipeline so that when we ask a question that cannot be answered with the documents in the database, the pipeline outputs no_answer
?
One idea that I have, that I have not had the time to test yet is to use a Reader that is not fine-tuned on SQuAD 2.0 (eg. the models we made available with this library) and, by testing a set of questions with answers and another one with no answers, find a threshold for the final score outputted by the pipeline (it's the 4th term in the tuple outputted by cdqa_pipeline.predict()
). When you find this threshold you can just do an if-else statement every time you try to predict on a question: if score > threshold -> output answer; else -> output "there is no answer".
Hello, We only use the context part in squad v2.0 dataset, and trying to build our own QA system.
@laifuchicago ,
Sorry, but I am not sure I understand what you are doing exactly, could you be more precise? Can you describe what are the steps you are following?
Hello, We just loaded the paragraphs from Squad v2, using the pdf_converter.
Thanks,
Could you please show the dataframe outputed by the pdf_converter? Just do a .head()
, the first 5 rows is ok.
Could you also please show an example of an incorrect answer? (The question and the output, please)
That's a bit weird, I have just tried to do the same on the train set of SQuAD 2.0, and I got:
There are two things you can do to try to mitigate that, vary the parameter retriever_score_weight
of the method .predict()
(which is 0.35 by default) and try to see if you get better results; or you can set the parameters n_predictions
to more than 1, to get a list with the n most probable results, like so:
But please be aware that the system is not perfect and it is not always going to give the best answers. There are still several improvements to be done, but we are really dependent on the Reader's performance, which is not perfect (c. 81% EM in the case of BERT base) even if you send to it only the paragraph with the correct answer.
Thank you, we will test on Monday. We tried 20 questions, and 18 of them are correct.
90% is a very high performance actually...
The system is expected to have less than 80% of the questions answered correctly.
To Andre:
Now we test the Squad v2 data in csv, but it has some errors just like below. Can you help us?
Thank you!
@laifuchicago Not sure if this would help but I also had the same error and this is how I solved it-
- I changed literal_eval to eval since it is a trusted input.
- The data in the dataframe should be a list data structure.
- Every paragraph should be enclosed in double-quotes followed by single quotes, i.e " followed by '