paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Cache embeddings and OCR results?

Open MilesCranmer opened this issue 2 years ago • 10 comments

Hey @whitead,

I think it would be really nice if I could spin up a paperqa instance without waiting for all papers to be OCR'd and embedded each time. Would it be okay to cache a few more things? Specifically:

  1. The embeddings used for FAISS, and
  2. The OCR of a PDF.

I think both of these could be stored in a separate cache file. The llmchain.cache.SQLiteCache makes this pretty easy, as there are only two methods: lookup for getting a result, and update for updating the cache. Both use strings as keys and values, but you could serialize the metadata into a string.

Here's an example for the OCR. This would go into readers.py and wrap by parse_pdf:

OCR_CACHE = llmchain.cache.SQLiteCache(CACHE_PATH.parent / "ocr_cache.db")

Then, parse_pdf would have[^1]:

def parse_pdf(...):
    cache_key = dict(prompt=str(pdf_path), llm_string="")
    test_out = OCR_CACHE.lookup(**cache_key)

    out = _parse_pdf(...) if test_out is None else deserialize(test_out)

    if test_out is None: 
        OCR_CACHE.update(**cache_key, serialize(out))

    return out

def _parse_pdf(...):
    # Regular _parse_pdf

What do you think? For serialization I would use json.dumps and json.loads[^2]. One might argue that it's better to use a custom database for this but why not keep it simple as you are already using the llmchain database.

Cheers, Miles

[^1]: You might want to hash the PDF file itself, or maybe the first N bytes, rather than the filename. But I think for starters the filename is simpler and fine. [^2]: For posterity, it would be wise to also serialize the paperqa.__version__ in each cache entry, and, if it's a different version, then ignore the cache and overwrite it.

MilesCranmer avatar Apr 07 '23 06:04 MilesCranmer

Great! Highly supportive - some of this work is duplicated from chroma, weaviate, pinecone, etc. But yes, I agree some slightly better caching would really help here.

Related to #38 - we should add an md5 hash on content when caching, instead of paths. Thanks for the contributions!

whitead avatar Apr 09 '23 19:04 whitead

Sounds good. We could also move that file hashing function into the utils.py folder for use elsewhere.

Regarding the embedding, is there a way for FAISS to cache itself? It looks like there isn't any exposed piece of the code where caching could be set up on the paper-qa side.

e.g., the line: https://github.com/whitead/paper-qa/blob/861f8057f7974238d729d4c1e252b3d64f1dcd90/paperqa/docs.py#L178 looks to be where it both (1) computes the embedding, and (2) adds it to the search tree. But we would want a way to cache (1)...

MilesCranmer avatar Apr 10 '23 00:04 MilesCranmer

Or maybe the langchain caching already handles the embeddings?

MilesCranmer avatar Apr 10 '23 01:04 MilesCranmer

@MilesCranmer - the pickling saves FAISS index. What I usually do is just store my docs object in .paperqa/{name}.pkl since that directory is built from llm cache. Would that work for you?

whitead avatar Apr 10 '23 04:04 whitead

Thanks, yes, that helps!

Followup question: what is the expected workflow here?

  1. Is it to add all papers into a single Docs() object, and make queries of that all-encompassing library?
  2. Or would a possible workflow be to have different Docs() objects for subcollections of papers/different projects?

In the case 1, pickling a single Docs() and searching it works well. But maybe in the case 2, it might make sense to cache paper embeddings individually, in case I want to have the same paper appears in multiple Docs(). And perhaps quickly generate a Docs() object for, e.g., writing a literature review on a particular topic (using only pre-vetted papers). What do you think? I guess the workflow affects whether individual caching makes sense or not.

Cheers! Miles

MilesCranmer avatar Apr 10 '23 04:04 MilesCranmer

One other reason to use a cache is that improved robustness against changing APIs. A cache could store simpler objects, like (paper_hash, embedding), which would be unaffected by new APIs. But a pickle file is a bit overkill and assumes the state is the exact same - which might result in unexpected behavior.

MilesCranmer avatar Apr 10 '23 06:04 MilesCranmer

@MilesCranmer Not finished yet, but I'd like to transition to hybrid search so that we have keyword vector search. One of those keywords can be attached to a subcollection.

I've tried to get to that effect via the new doc_match function that filters the vector similarity search, but long-term there might be better ways like what's done in llama-index

whitead avatar Apr 10 '23 06:04 whitead

I see, sounds like a great idea!

MilesCranmer avatar Apr 10 '23 06:04 MilesCranmer

Do you know where I could add the embeddings cache? It does seem like it regenerates it every time (unless I am unpickling an existing Docs() object), and despite the langchain cache being set up.

MilesCranmer avatar Apr 13 '23 18:04 MilesCranmer

I prefer pickling Docs as the recommended method of caching, just to reduce complexity

whitead avatar Apr 21 '23 20:04 whitead

@MilesCranmer we just released version 5 today that:

  • Moves all LLM management to https://github.com/BerriAI/litellm
  • Drops both LangChain and faiss from dependencies
  • Centers on a directory of texts

Reading up on this issue, it seems like the constraints here have changed that perhaps the issue no longer persists.

Leaving this open, feel free to follow up on this with paper-qa>=5

jamesbraza avatar Sep 11 '24 17:09 jamesbraza

Thanks! I’ll shelve this and let you know

MilesCranmer avatar Sep 11 '24 19:09 MilesCranmer