paper-qa
paper-qa copied to clipboard
Eval paper-qa 2 on lit qa 2
Hello, Thanks for the great work! I was looking for a way to eval paper-qa 2 on lit-qa 2 and was wondering if you can help me with a couple of queries: i) Assuming I am able to get the pdfs from the dois in the dataset into a local folder, how do I go about at least running paper qa-2 end to end specifically on the lit-qa 2 multiple choice questions? ii) Are the answers from paper-qa2 and other systems in the paper on lit-qa 2 questions available publicly? iii) Lastly, the human and non RAG results on lit-qa 2 in the lab bench paper seem different that in paper-qa2 ... I am not sure if I am missing something, but can you please say a bit more on this?
Thanks!
Hey @amanpreet692! 👋 I'm here to help you with any bugs, questions, or becoming a contributor. Let's squash those bugs together!
To evaluate PaperQA2 on Lit-QA 2, here are the steps and answers to your queries:
i) Running PaperQA2 End-to-End on Lit-QA 2 Multiple Choice Questions
-
Install PaperQA2:
pip install paper-qa -
Prepare Your Local Dataset:
- Ensure you have the PDFs from the DOIs in the Lit-QA 2 dataset in a local folder.
-
Run PaperQA2:
- Navigate to your directory containing the PDFs:
cd my_papers - Use the
pqacommand to ask a question:pqa ask 'How can carbon nanotubes be manufactured at a large scale?'
- Navigate to your directory containing the PDFs:
This process will index your local PDF files, gather the necessary metadata, search over that index, break the files into chunked evidence contexts, rank them, and ultimately generate an answer [1].
ii) Availability of Answers from PaperQA2 and Other Systems on Lit-QA 2 Questions
I wasn't able to find an answer within the repository, but the answer may be available elsewhere or I could have missed it.
iii) Differences in Results Between Lab Bench Paper and PaperQA2
The differences in results between the Lab Bench paper and PaperQA2 on Lit-QA 2 could be due to several factors, including different evaluation metrics, datasets, or configurations used in the experiments. Without more specific details, it's challenging to pinpoint the exact reasons for the discrepancies. It might be helpful to review the methodologies and configurations used in both evaluations to understand the differences better.
To continue talking to Dosu, mention @dosu.
Hello @amanpreet692 thanks for the questions.
i) Assuming I am able to get the pdfs from the dois in the dataset into a local folder, how do I go about at least running paper qa-2 end to end specifically on the lit-qa 2 multiple choice questions?
We have a bunch if tickets here on making some docs for end-to-end usage of the code. For now, you can look at the unit tests for direct access to the code.
ii) Are the answers from paper-qa2 and other systems in the paper on lit-qa 2 questions available publicly?
Yeah they are from the LAB-Bench paper, whose data is here: https://github.com/Future-House/lab-bench and https://huggingface.co/datasets/futurehouse/lab-bench/viewer/LitQA2
Let's have part of resolving this ticket be to document this better in the README.
iii) Lastly, the human and non RAG results on lit-qa 2 in the lab bench paper seem different that in paper-qa2 ... I am not sure if I am missing something, but can you please say a bit more on this?
Can you clarify what differences you're observing?
Thanks for the prompt reply! Some follow ups please -
We have a bunch if tickets here on making some docs for end-to-end usage of the code. For now, you can look at the unit tests for direct access to the code.
Could you kindly point me to a relevant test case?
Yeah they are from the LAB-Bench paper, whose data is here: https://github.com/Future-House/lab-bench and https://huggingface.co/datasets/futurehouse/lab-bench/viewer/LitQA2
Thanks! However, the data set only consists of the ground truth answers if I am not mistaken? I was more interested in comparing another RAG based QA system with Paper QA 2; since all 248 questions are not public and the paper reports the cumulative score. I'd really appreciate if you have any pointers for the same.
Can you clarify what differences you're observing?
Sure! One instance would be that Human accuracy in Fig 1B in paper qa 2 is reported to be 63.1 whereas it's reported 70 in the lab bench paper Table 2 in the appendix.
Thanks again for the great work!
Hello @amanpreet692 thanks again for the good questions.
We've just updated the LAB-Bench docs to include reasoning for the holdout subset: https://github.com/Future-House/LAB-Bench/commit/fc87dfdb53e05726f5d7e6f4d02e1392457acde1
Thanks! However, the data set only consists of the ground truth answers if I am not mistaken?
The dataset contains, questions, distractors and answers. So if you want to make a hold out subset yourself, just filter out the answers. Am I missing something here?
Could you kindly point me to a relevant test case?
Yeah sure here it is:
- Unit test: https://github.com/Future-House/paper-qa/blob/v5.0.5/tests/test_task.py#L76
- Docs: https://github.com/Future-House/paper-qa/tree/v5.0.5#running-on-litqa-v2
We are working on making it really easy to reproduce our results, both from PaperQA and LAB-Bench side. You will have to stay tuned a bit for an end-to-end repro to come available.
Please let me know if you have any other questions
Hi @amanpreet692 one relevant tidbit to note now is https://github.com/Future-House/LAB-Bench/tree/main/LitQA2 notates open papers using is_opensource tags. Maybe that helps you
Hello @amanpreet692 we have mostly completed our reproducibility over the past months:
- https://arxiv.org/abs/2412.21154 was preprinted in December 2024 (3 months ago), which comes with a test split in
aviary-paper-data'sLitQA2subset. Question IDs and source DOIs for that test split are documented here: https://github.com/Future-House/paper-qa#reproduction- It's up to reproducers to build the index themselves, as we can't give the actual PDFs
- To run PaperQA on LitQA2, we now have a special package for this:
aviary.litqa
Please let us know if you have any questions, and thanks again for a good issue.