paper-qa
paper-qa copied to clipboard
How to change metadata provider(Crossref and Semantic Scholar)?
I don't have API keys for Crossref and Semantic Scholar, so I'd like to use a PubMed API key as an alternative. Could you please guide me on how to integrate the PubMed API into paperqa? Since paperqa is packaged as a Python library, I'm also wondering what modifications I need to make.
As a complete beginner, I would greatly appreciate it if you could outline a general workflow for these changes. Your help would mean a lot to me. Thank you!
Hi @Hamesen! I'm Dosu and I’m helping the paper-qa team.
To switch paper-qa’s metadata provider from Crossref/Semantic Scholar to PubMed, you’ll need to add a new provider class for PubMed and configure the library to use it. Here’s a beginner-friendly workflow:
-
Create a PubMed provider: In
paperqa/clients/, create a new file (e.g.,pubmed.py). Copy the structure ofcrossref.pyorsemantic_scholar.py. Your class should inherit fromDOIOrTitleBasedProviderand implement the_queryandquery_transformermethods. These methods will use the PubMed (Entrez E-utilities) API to fetch metadata by DOI or title, then parse the response into aDocDetailsobject. See how this is done for other providers for reference (example). -
Handle the PubMed API key: In your new provider, read the API key from an environment variable (e.g.,
os.environ.get("PUBMED_API_KEY")), just like other providers do for their keys (example). -
Register your provider: In
paperqa/clients/__init__.py, add your new provider to the list of clients (replace or remove the defaults if you only want PubMed). You can do this by editing theDEFAULT_CLIENTSset or by passing your provider directly toDocMetadataClientwhen you use it (details). -
Update BibTeXSource: For provenance tracking, add a
PUBMEDvalue to theBibTeXSourceenum (if you want to track where BibTeX entries come from, as recent code changes do for other providers) (reference). -
Set your API key: In your environment, set the
PUBMED_API_KEYvariable so your provider can use it. -
Test your integration: Use the patterns in
tests/test_clients.pyto test your provider. You can run queries by DOI or title and check that metadata comes from PubMed.
You don’t need to change anything outside the clients directory or the provider registration. The codebase is designed to make adding or swapping providers straightforward, even for beginners. If you need a template, look at the existing provider files for structure and logic (reference).
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
When I don't provide papers to paperqa, it's unable to answer my questions. I'm wondering if paperqa supports keyword-based literature search functionality, which would allow it to proactively retrieve resources when no papers are provided.
Additionally, I'd like to clarify: after paperqa obtains metadata via the Semantic Scholar API, is the next step to download relevant papers using this metadata to answer questions? And what is the purpose of obtaining this metadata?
I would greatly appreciate it if you could provide answers to these questions.
Paper-qa's core system does not support proactive keyword-based literature search or automatic retrieval of resources when no papers are provided—it can only answer questions using documents you supply (like PDFs or URLs). However, there are optional modules for keyword-based search: the OpenReview integration lets you search and download relevant papers from selected conferences using a question or keywords, and the Zotero integration allows you to search your personal bibliography and add matching papers to paper-qa. Both require explicit setup and action from you—they are not automatic features of the main workflow. You can find more details and examples in the documentation for OpenReview and Zotero helpers.
After paper-qa obtains metadata via the Semantic Scholar API, it does not automatically download the full papers. The metadata is used for citation, provenance tracking, and sometimes to provide links to open access PDFs, but you need to manually add any papers you want to use for question answering. The main purpose of obtaining metadata is to enhance citation quality and keep track of document sources, not to fetch or download the actual papers automatically.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Since paperqa doesn’t proactively download relevant literature, I’m looking to develop a literature repository. How should I structure this repository to ensure it’s compatible with paperqa’s retrieval methods? Additionally, what interfaces (or APIs) should I provide to paperqa for accessing the literature?
I would greatly appreciate your guidance on these questions.
To make your literature repository compatible with paper-qa, organize your papers in a directory (the default is usually called papers/). Paper-qa will automatically index all supported files—PDF, TXT, HTML, and MD—in this directory and its subdirectories by default. File naming is not important as long as the extension is correct. You can customize which files are indexed using the files_filter setting if needed.
For more precise metadata (like titles, DOIs, or custom citations), you can add a manifest CSV file with columns such as file_location, doi, title, key, docname, and citation. This manifest is optional but helps avoid ambiguity and improves citation quality, especially if you want to skip LLM-based metadata extraction. The manifest should live in the root of your repository or wherever you specify in your settings. See an example and more details here.
Paper-qa accesses your repository via filesystem paths—no custom API is needed. You can also add documents programmatically using the Docs class and its async methods (aadd, aadd_file, aadd_url), which accept file paths, file-like objects, or URLs. This means you can integrate custom downloaders or external sources by saving files to your repository or passing them directly to paper-qa in code. For advanced workflows, see how the Zotero and OpenReview integrations populate the Docs object.
To sum up: keep your repository as a directory of supported file types, optionally use a manifest for metadata, and let paper-qa access it via filesystem paths or by passing files/URLs directly in code. No extra API endpoints are required—just make sure paper-qa can read the files. For best results, set settings.agent.index.use_absolute_paper_directory to False to make your setup resilient to directory renames or moves source.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other