paper-qa
paper-qa copied to clipboard
Cache embeddings and OCR results?
Hey @whitead,
I think it would be really nice if I could spin up a paperqa instance without waiting for all papers to be OCR'd and embedded each time. Would it be okay to cache a few more things? Specifically:
- The embeddings used for FAISS, and
- The OCR of a PDF.
I think both of these could be stored in a separate cache file. The llmchain.cache.SQLiteCache makes this pretty easy, as there are only two methods: lookup for getting a result, and update for updating the cache. Both use strings as keys and values, but you could serialize the metadata into a string.
Here's an example for the OCR. This would go into readers.py and wrap by parse_pdf:
OCR_CACHE = llmchain.cache.SQLiteCache(CACHE_PATH.parent / "ocr_cache.db")
Then, parse_pdf would have[^1]:
def parse_pdf(...):
cache_key = dict(prompt=str(pdf_path), llm_string="")
test_out = OCR_CACHE.lookup(**cache_key)
out = _parse_pdf(...) if test_out is None else deserialize(test_out)
if test_out is None:
OCR_CACHE.update(**cache_key, serialize(out))
return out
def _parse_pdf(...):
# Regular _parse_pdf
What do you think? For serialization I would use json.dumps and json.loads[^2]. One might argue that it's better to use a custom database for this but why not keep it simple as you are already using the llmchain database.
Cheers, Miles
[^1]: You might want to hash the PDF file itself, or maybe the first N bytes, rather than the filename. But I think for starters the filename is simpler and fine.
[^2]: For posterity, it would be wise to also serialize the paperqa.__version__ in each cache entry, and, if it's a different version, then ignore the cache and overwrite it.
Great! Highly supportive - some of this work is duplicated from chroma, weaviate, pinecone, etc. But yes, I agree some slightly better caching would really help here.
Related to #38 - we should add an md5 hash on content when caching, instead of paths. Thanks for the contributions!
Sounds good. We could also move that file hashing function into the utils.py folder for use elsewhere.
Regarding the embedding, is there a way for FAISS to cache itself? It looks like there isn't any exposed piece of the code where caching could be set up on the paper-qa side.
e.g., the line: https://github.com/whitead/paper-qa/blob/861f8057f7974238d729d4c1e252b3d64f1dcd90/paperqa/docs.py#L178 looks to be where it both (1) computes the embedding, and (2) adds it to the search tree. But we would want a way to cache (1)...
Or maybe the langchain caching already handles the embeddings?
@MilesCranmer - the pickling saves FAISS index. What I usually do is just store my docs object in .paperqa/{name}.pkl since that directory is built from llm cache. Would that work for you?
Thanks, yes, that helps!
Followup question: what is the expected workflow here?
- Is it to add all papers into a single
Docs()object, and make queries of that all-encompassing library? - Or would a possible workflow be to have different
Docs()objects for subcollections of papers/different projects?
In the case 1, pickling a single Docs() and searching it works well. But maybe in the case 2, it might make sense to cache paper embeddings individually, in case I want to have the same paper appears in multiple Docs(). And perhaps quickly generate a Docs() object for, e.g., writing a literature review on a particular topic (using only pre-vetted papers). What do you think? I guess the workflow affects whether individual caching makes sense or not.
Cheers! Miles
One other reason to use a cache is that improved robustness against changing APIs. A cache could store simpler objects, like (paper_hash, embedding), which would be unaffected by new APIs. But a pickle file is a bit overkill and assumes the state is the exact same - which might result in unexpected behavior.
@MilesCranmer Not finished yet, but I'd like to transition to hybrid search so that we have keyword vector search. One of those keywords can be attached to a subcollection.
I've tried to get to that effect via the new doc_match function that filters the vector similarity search, but long-term there might be better ways like what's done in llama-index
I see, sounds like a great idea!
Do you know where I could add the embeddings cache? It does seem like it regenerates it every time (unless I am unpickling an existing Docs() object), and despite the langchain cache being set up.
I prefer pickling Docs as the recommended method of caching, just to reduce complexity
@MilesCranmer we just released version 5 today that:
- Moves all LLM management to https://github.com/BerriAI/litellm
- Drops both LangChain and faiss from dependencies
- Centers on a directory of texts
Reading up on this issue, it seems like the constraints here have changed that perhaps the issue no longer persists.
Leaving this open, feel free to follow up on this with paper-qa>=5
Thanks! I’ll shelve this and let you know