paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Use pdf checksums instead of file paths for indexing?

Open khughitt opened this issue 8 months ago • 2 comments

At present, paper-qa seems to create indices indexed by file path.

For example, renaming the root document directory for papers to papers2 causes paper-qa to treat all files inside as "new".

Would it be possible to avoid using file paths and instead use something like md5 hashes of the PDFs?

This way the input papers can be moved / reorganized without having to recompute the indices.

khughitt avatar Mar 29 '25 19:03 khughitt

I mostly follow what you're saying but want to confirm. Are you talking about:

  • The key in the index: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/agents/search.py#L313
  • The autogenerated index name: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/settings.py#L774-L779

One way to approach the first bullet is to use relative paths by setting settings.agent.index.use_absolute_paper_directory to False: https://github.com/Future-House/paper-qa/blob/v5.20.0/paperqa/agents/search.py#L509-L510

Actually this is what we do internally at FutureHouse.

This will trigger paths to be relative paths, and should be resilient to root directory renames

jamesbraza avatar Mar 29 '25 23:03 jamesbraza

I'm thinking of the key in the index .

I noticed the settings.agent.index.use_absolute_paper_directory option, but from the README and code (https://github.com/Future-House/paper-qa/blob/29135d0a37b3f4f61116642f1bcddf402da5d3ed/paperqa/settings.py#L405), it looks like it defaults to False, which is what I want?

I've only just started experimenting with paper-qa, so it's also completely possible that I am just doing something untoward in the settings.

Steps to recreate (on my end..):

  1. create a directory ("test1") with a single paper
  2. python test.py
  3. mv test1 test2
  4. python test.py

test.py

from paperqa import Settings, ask

local_llm_config = {
    "model_list": [
        {
            "model_name": "ollama/deepseek-r1:1.5b",
            "litellm_params": {
                "model": "ollama/deepseek-r1:1.5b",
                "api_base": "http://localhost:11434"
            }
        }
    ]
}

answer = ask(
    "What are the most important approaches used in drug discovery?",
    settings=Settings(
        paper_directory="/path/to/test1",
        llm="ollama/deepseek-r1:1.5b",
        embedding="ollama/nomic-embed-text",
        llm_config=local_llm_config,
        summary_llm="ollama/deepseek-r1:1.5b",
        summary_llm_config=local_llm_config
    )
)

When I run the the first time, the paper is indexed and queried as expected.

Running the script a second time without changing the directory name skips the indexing and goes straight to the query, as expected.

Once I rename the "test1" directory to something else, however, paper-qa treats it as new again:

[13:25:05] New file to index: xx.pdf...

Version: 17fb0a3d650db98b7c94fd378c76b90dc8fa8b4b (Mar 29, 2025)

khughitt avatar Mar 30 '25 17:03 khughitt