NeuSym-RAG
NeuSym-RAG copied to clipboard
[ACL 2025] NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering
NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering
ACL 2025 (Main)
ACL 2025 (Main)
💫 Table of Contents (Click to expand)
- 💡 Main Contributions
- 🔍 Quick Start
- 📖 PDF Parsing and Encoding
- 📊 Experiment Results
- 📚 Detailed Documents and Tutorials
- ✍🏻 Citation
💡 Main Contributions
- We are the first to integrate both vector-based neural retrieval and SQL-based symbolic retrieval into a unified and interactive NeuSym-RAG framework through executable actions.
- We incorporate multiple views for parsing and vectorizing PDF documents, and adopt a structured database schema to systematically organize both text tokens and encoded vectors.
- Experiments on three realistic full PDF-based QA datasets w.r.t. academic research (AirQA-Real, M3SciQA and SciDQA) validate the superiority over various neural and symbolic baselines.
🔍 Quick Start
-
Create the conda environment and install dependencies:
- Install
poppleron your system - Follow the Official Guide to install MinerU based on your OS platform
- Check our TroubleShooting tips to ensure the installation of MinerU is successful
- Install other pip requirements
conda create neusymrag python=3.10 conda activate neusymrag # install MinerU pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com # install other dependencies pip install -r requirements.txt - Install
-
Prepare the following models for vector encoding:
-
sentence-transformers/all-MiniLM-L6-v2 -
BAAI/bge-large-en-v1.5 -
openai/clip-vit-base-patch32 - For embedding model customization, refer to vectorstore doc
mkdir -p .cache/ && cd .cache/ git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 git clone https://huggingface.co/BAAI/bge-large-en-v1.5 git clone https://huggingface.co/openai/clip-vit-base-patch32 ... # download other vector encoding models if needed -
-
Download the dataset-related files into the folder
data/dataset👉🏻 HuggingFace-
AirQA-Real: this work, including themetadata/,papers/, andprocessed_data/ -
M3SciQA: including themetadata/,papers/,images/, andprocessed_data/ -
SciDQA: including themetadata/,papers/, andprocessed_data/
Organize them into the following folder structure 👇🏻
data/dataset/ ├── airqa/ │ ├── data_format.json.template │ ├── metadata/ # metadata for all PDFs | | | ├── aa0e0451-f10a-539b-9c6c-0be53800b94f.json | | | └── ... # more metadata for PDFs in ACL 2023 │ ├── papers/ | | ├── acl2023/ | | | ├── aa0e0451-f10a-539b-9c6c-0be53800b94f.pdf | | | └── ... # more PDFs in ACL 2023 | | ├── iclr2024/ | | | ├── aa071344-e514-52f9-b9cf-9bea681a68c2.pdf | | | └── ... # more PDFs in ICLR 2024 | | └── ... # more conference + year subfolders │ ├── processed_data/ | | | ├── aa0e0451-f10a-539b-9c6c-0be53800b94f.json | | | └── ... # more processed data for PDFs in ACL 2023 │ ├── test_data_553.jsonl # one line for each example │ ├── test_data_ablation.jsonl │ └── uuids.json # uuids for all PDFs ├── m3sciqa/ │ ├── images/ | | ├── 2310.04988/ | | | └── HVI_figure.png | | └── ... # more image subfolders │ ├── metadata/ │ ├── papers/ │ ├── processed_data/ │ ├── test_data.jsonl │ ├── mappings.json │ └── uuids.json ├── scidqa/ │ ├── metadata/ │ ├── papers/ │ ├── processed_data/ │ ├── test_data.jsonl │ ├── test_data_775.jsonl │ ├── mappings.json │ └── uuids.json |── test_pdf.pdf └── ccf_catalog.csv -
-
Download our constructed databases (
.duckdb) and vectorstores (.dbandbm25.json) into the foldersdata/database/anddata/vectorstore/, respectively (👉🏻 HuggingFace 🔗). Otherwise, you can construct them by yourself (see PDF Parsing and Encoding).-
The 3 dataset name to database / vectorstore name mappings are:
Dataset Dataset Name Database Name Vectorstore Name AirQA-Real airqaai_researchai_researchM3SciQA m3sciqaemnlp_papersemnlp_papersSciDQA scidqaopenreview_papersopenreview_papers
Folder structures for databases and vectorstores 👇🏻
data/ ├── database/ │ ├── ai_research/ │ │ ├── ai_research.duckdb │ │ ├── ai_research.json │ │ └── ai_research.sql │ ├── emnlp_papers/ │ │ ├── emnlp_papers.duckdb │ │ ├── emnlp_papers.json │ │ └── emnlp_papers.sql │ ├── openreview_papers/ │ │ ├── openreview_papers.duckdb │ │ ├── openreview_papers.json │ │ └── openreview_papers.sql ├── vectorstore/ │ ├── milvus/ # this universal folder is for Milvus launched via Docker containers │ │ └── standalone_embed.sh │ ├── ai_research/ # other folders are for Milvus launched standalone xxx.db │ │ ├── ai_research.db │ │ └── bm25.json │ ├── emnlp_papers/ │ │ ├── emnlp_papers.db │ │ └── bm25.json │ ├── openreview_papers/ │ │ ├── openreview_papers.db │ │ └── bm25.json │ ├── filter_rules.json │ ├── vectorstore_schema.json │ └── vectorstore_schema.json.template -
-
Run the following commands to compare the performance of our NeuSym-RAG framework with the Classic RAG approach: (the evaluation is also included at the end)
- Configure the
OPENAI_API_KEYandOPENAI_BASE_URL(if needed)
export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxx" export OPENAI_BASE_URL="https://api.openai.com/v1" # Classic RAG baseline $ python scripts/classic_rag_baseline.py --dataset airqa --test_data test_data_553.jsonl --vectorstore ai_research --agent_method classic_rag --llm gpt-4o-mini $ python scripts/classic_rag_baseline.py --dataset m3sciqa --test_data test_data.jsonl --vectorstore emnlp_papers --agent_method classic_rag --llm gpt-4o-mini $ python scripts/classic_rag_baseline.py --dataset scidqa --test_data test_data_775.jsonl --vectorstore openreview_papers --agent_method classic_rag --llm gpt-4o-mini # NeuSym-RAG framework $ python scripts/hybrid_neural_symbolic_rag.py --dataset airqa --test_data test_data_553.jsonl --database ai_research --agent_method neusym_rag --llm gpt-4o-mini $ python scripts/hybrid_neural_symbolic_rag.py --dataset m3sciqa --test_data test_data.jsonl --database emnlp_papers --agent_method neusym_rag --llm gpt-4o-mini $ python scripts/hybrid_neural_symbolic_rag.py --dataset scidqa --test_data test_data_775.jsonl --database openreview_papers --agent_method neusym_rag --llm gpt-4o-mini🚀 NOTE: For more agent baselines (e.g.,
iterative_neu_ragandtwo_stage_hybrid_rag) and variable parameters (e.g., using open-source LLMs like Qwen2.5-VL-Instruct), please refer to 📘 Agent Baselines. - Configure the
📖 PDF Parsing and Encoding
We also provide the scripts to quickly parse and encode new paper PDFs into existing DB and VS. Take the dataset airqa (and DB / VS ai_research) as an example:
📌 NOTE:
- If DB and VS do not exist, they will be created automatically
- Add the argument
--from_scratchfor any script below will delete existing ones firstly
PDF Parsing and Encoding for In-coming Requests
-
Multiview Document Parsing: This step accepts various input types and store the parsed PDF content into the DuckDB database.
- The default DB is
data/database/${database}/${database}.duckdbunless you specify args--database_path /path/to/db.duckdb - The config file
ai_research_config.jsondefines the pipeline functions of parsing PDFs, which can be customized according to our pre-defined rules
$ python utils/database_utils.py --database ai_research --config_path configs/ai_research_config.json --pdf_path ${pdf_to_parse}Valid input types of args
--pdf_path ${pdf_to_parse}include:-
PDF UUID: For example,
16142be2-ac28-58e5-9271-8af299b18d91. In this case, the metadata of the PDF is pre-processed (that ismetadata/${uuid}.jsonalready exists, see Metadata Format), and the raw PDF file has been downloaded into thepapers/subfolder/${uuid}.pdffolder following thepdf_pathfield in the metadata. -
Local PDF path to the file (if the PDF file basename is a valid UUID, it reduces to the former case), e.g.,
~/Downloads/2005.14165.pdfordata/dataset/airqa/papers/iclr2024/aa071344-e514-52f9-b9cf-9bea681a68c2.pdf -
Web URL of the PDF file which is downloadable, e.g.,
https://arxiv.org/abs/2005.14165 -
Title or arxiv id of the paper, e.g.,
Language Models are Few-Shot Learnersor2005.14165 -
A filepath (
.jsonlist or.txtper line) containing the list of any 4 types above, e.g.,pdfs_to_parse.jsonorpdfs_to_parse.txt.$ cat pdfs_to_parse.json [ "16142be2-ac28-58e5-9271-8af299b18d91", "9c5c3a63-3042-582a-9358-d0c61de3330d" ... ] $ cat pdfs_to_parse.txt 16142be2-ac28-58e5-9271-8af299b18d91 9c5c3a63-3042-582a-9358-d0c61de3330d ...
📌 NOTE: Sometimes, the function to obtain paper metadata via scholar APIs may fail (see Scholar APIs). For papers published in a conference or venue, we recommend centrally processing the metadata in advance and downloading the PDF files beforehand.
- The default DB is
-
Multimodal Vector Encoding: Before vector encoding, please ensure that the PDF content has already been parsed into the corresponding DB, and the metadata
${uuid}.jsonand raw file${uuid}.pdfalready exist under themetadata/andpapers/folders. Attention that:- We only accept PDF UUIDs as the input PDF(s)
- Please ensure the embedding models exist under
.cache/and the corresponding collection name exactly follows our VS naming convention defined in the vectorstore schema - Please ensure that the
bm25.jsonfile exists under the pathdata/vectorstore/${vectorstore}/bm25.jsonif you want to use BM25 collection. Otherwise, create the BM25 vocabulary firstly - The default VS is launched from
data/vectorstore/${vectorstore}/${vectorstore}.db(standalone mode). This file path can be specified via args--vectorstore_path /path/to/vs.db - The default launch method for VS is
standaloneunless you specify args like--launch_method dockerand--docker_uri http://127.0.0.1:19530 - If your OS is Windows, please follow the guide on Run Milvus in Docker (Windows)
# By default, using standalone mode (*.db) $ python utils/vectorstore_utils.py --vectorstore ai_research --pdf_path pdf_uuids_to_encode.json # --launch_method=standalone # Or, using Docker containers $ cd data/dataset/vectorstore/milvus && bash standalone_embed.sh start # start Milvus containers $ cd - # return to the project root $ python utils/vectorstore_utils.py --vectorstore ai_research --pdf_path pdf_uuids_to_encode.txt --launch_method docker --docker_uri http://127.0.0.1:19530 $ cd data/dataset/vectorstore/milvus && bash standalone_embed.sh stop # stop Milvus containers -
The Complete Parsing and Encoding Pipeline: If you want to parse and encode new PDFs in one step, use the following command:
- Please ensure that
databaseandvectorstorenames are the same -
pdf_pathandconfig_path: these arguments are the same with those in Multiview Document Parsing - If you want to launch the vectorstore via Docker containers, see Multimodal Vector Encoding
python utils/data_population.py --database ai_researh --vectorstore ai_research --pdf_path pdfs.json --config_path configs/ai_research_config.json - Please ensure that
💡 TIP: If you want to accelerate the PDF parsing process given abundant papers, please refer to Database Population Acceleration.
📊 Experiment Results
We compare our NeuSym-RAG with Classic-RAG on $3$ full-PDF-based academic research Q&A datasets using $5$ LLMs/VLMs:
| Method | Model | Dataset AVG | M3SciQA AVG | SciDQA AVG |
|---|---|---|---|---|
| Classic-RAG | GPT-4o-mini | 13.4 | 15.6 | 59.8 |
| GPT-4V | 14.7 | 11.1 | 57.4 | |
| Llama-3.3-70B-Instruct | 10.0 | 11.3 | 58.0 | |
| Qwen2.5-VL-72B-Instruct | 10.5 | 11.6 | 56.2 | |
| DeepSeek-R1 | 13.9 | 11.2 | 62.4 | |
| NeuSym-RAG | GPT-4o-mini | 30.7 | 18.0 | 63.0 |
| GPT-4V | 37.3 | 13.6 | 63.1 | |
| Llama-3.3-70B-Instruct | 29.3 | 23.6 | 56.4 | |
| Qwen2.5-VL-72B-Instruct | 39.6 | 21.1 | 60.5 | |
| DeepSeek-R1 | 32.4 | 17.4 | 64.5 |
📈 Evaluation
The instance-specific evaluation metric is defined in the field evaluator for each data sample (see Data Format). For evaluation, create a .jsonl file for each testing sample and store the predicted string or object in the field answer as well as its uuid like:
{"uuid": "00608f20-e3f5-5fdc-8979-4efeb0756d8e", "answer": "True"}
{"uuid": "00b28687-3ea1-5974-a1ec-80d7f6cd3424", "answer": "3.14"}
...
Then, you can run the following command to evaluate the performance: (Take dataset airqa as an example)
python utils/eval_utils.py --gold data/dataset/airqa/test_data_553.jsonl --pred test_data_553_pred.jsonl --dataset airqa --output evaluation.log
📚 Detailed Documents and Tutorials
Fine-grained documents in this project are detailed in folder documents/. Here is the checklist:
| Documents | Description |
|---|---|
📓 documents/dataset.md |
Dataset folder structure, statistics, download links, and utility functions |
📔 documents/airqa_format.md |
Data format and paper metadata format. |
📕 documents/database.md |
Database folder structure, database population framework, database schema format and parallel processing tricks. |
📗 documents/vectorstore.md |
Vectorstore folder structure, JSON formats of inserted data entries and the vectorstore schema, vector encoding framework, and the complate data population process. |
📘 documents/agent.md |
Details on different agent methods, as well as the running scripts and arguments. |
📙 documents/third_party_tools.md |
Scholar APIs to get the paper metadata and MinerU library for PDF parsing. |
✍🏻 Citation
If you find this project useful, please cite our work:
@inproceedings{cao-etal-2025-neusym,
title = "{N}eu{S}ym-{RAG}: Hybrid Neural Symbolic Retrieval with Multiview Structuring for {PDF} Question Answering",
author = "Cao, Ruisheng and
Zhang, Hanchong and
Huang, Tiancheng and
Kang, Zhangyi and
Zhang, Yuxin and
Sun, Liangtai and
Li, Hanqi and
Miao, Yuxun and
Fan, Shuai and
Chen, Lu and
Yu, Kai",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.311/",
pages = "6211--6239",
ISBN = "979-8-89176-251-0"
}