dspy-arxiv
dspy-arxiv copied to clipboard
Explore the use of DSPy for extracting features from PDFs 🔎
dspy-arxiv
Explore the use of DSPy for extracting features from PDFs. This repository provides a simple example of how to use this framework to predict the sub-category of a Computer Science paper from arXiv.
Suggested Installation
- Clone this repository.
- Create a virtual environment.
- Install dependencies from requirements.txt.
- Install the virtual environment as a Jupyter kernel.
Build Dataset & Database
The dataset is a selection of 150 arXiv papers (metadata + pdf) from the computer science category.
To build the database:
- Download the JSON file from Kaggle into the
dspy-arxivdirectory. - Rename the file to
arxiv.json. - Run the notebook
data.ipynbfrom top to bottom.
At the end, you should have two directories:
- dspy-arxiv/database
- arxiv.json - the original JSON file with only the computer science category
- dspy-arxiv/dataset
- trainset - 50 JSON files with metadata + text used for "training"
- valset - 50 JSON files with metadata + text used for "validation"
- testset - 50 JSON files with metadata + text used for "testing"
If you want to add RAG to the pipeline, it's handy to have the data in a vector database for fast retrieval. Check out database.py for an example script to set up chromadb and populate it with arXiv metadata.
Features Extraction
The notebook features.ipynb can be seen as a simple tutorial on how to use DSPy to programmatically prompt LLM for feature extraction (in this case, predicting the sub-category of a Computer Science paper from arXiv).
You can also take a look at the slides generated from this notebook.