paper-qa
paper-qa copied to clipboard
Use pdf checksums instead of file paths for indexing?
At present, paper-qa seems to create indices indexed by file path.
For example, renaming the root document directory for papers to papers2 causes paper-qa to treat all files inside as "new".
Would it be possible to avoid using file paths and instead use something like md5 hashes of the PDFs?
This way the input papers can be moved / reorganized without having to recompute the indices.
I mostly follow what you're saying but want to confirm. Are you talking about:
- The key in the index: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/agents/search.py#L313
- The autogenerated index name: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/settings.py#L774-L779
One way to approach the first bullet is to use relative paths by setting settings.agent.index.use_absolute_paper_directory to False: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/agents/search.py#L509-L510
Actually this is what we do internally at FutureHouse.
This will trigger paths to be relative paths, and should be resilient to root directory renames
I'm thinking of the key in the index .
I noticed the settings.agent.index.use_absolute_paper_directory option, but from the README and code (https://github.com/Future-House/paper-qa/blob/29135d0a37b3f4f61116642f1bcddf402da5d3ed/paperqa/settings.py#L405), it looks like it defaults to False, which is what I want?
I've only just started experimenting with paper-qa, so it's also completely possible that I am just doing something untoward in the settings.
Steps to recreate (on my end..):
- create a directory ("test1") with a single paper
python test.pymv test1 test2python test.py
test.py
from paperqa import Settings, ask
local_llm_config = {
"model_list": [
{
"model_name": "ollama/deepseek-r1:1.5b",
"litellm_params": {
"model": "ollama/deepseek-r1:1.5b",
"api_base": "http://localhost:11434"
}
}
]
}
answer = ask(
"What are the most important approaches used in drug discovery?",
settings=Settings(
paper_directory="/path/to/test1",
llm="ollama/deepseek-r1:1.5b",
embedding="ollama/nomic-embed-text",
llm_config=local_llm_config,
summary_llm="ollama/deepseek-r1:1.5b",
summary_llm_config=local_llm_config
)
)
When I run the the first time, the paper is indexed and queried as expected.
Running the script a second time without changing the directory name skips the indexing and goes straight to the query, as expected.
Once I rename the "test1" directory to something else, however, paper-qa treats it as new again:
[13:25:05] New file to index: xx.pdf...
Version: 17fb0a3d650db98b7c94fd378c76b90dc8fa8b4b (Mar 29, 2025)