paper-qa Using external vector DBs

One of my struggle when using PaperQA is the inference time every time I do queries. Is there a means I could do to import external vector DBs rather than just relying on NumpyVectorStore? caching is not an option since I will be needing the embedded results on a scheduled basis and it is stored in cloud to save memory.

The documentation is lacking on the part of providing external embeddings, I would appreciate a working sample use case

Nov 09 '24 02:11 markokow

Hey @markokow, I'm here to help you out with any bugs or questions you have while waiting for a human maintainer. Let's squash those bugs together! 🤖

PaperQA provides an interface for plugging in custom vector stores, allowing you to use different vector stores and embedding models. By default, it uses Numpy as a dense vector store, but you can specify a custom embedding model when adding new papers to the Docs object. Here's an example of how you can use a custom embedding model:

from paperqa import Docs, HybridEmbeddingModel, SparseEmbeddingModel, LiteLLMEmbeddingModel

model = HybridEmbeddingModel(
    models=[LiteLLMEmbeddingModel(), SparseEmbeddingModel(ndim=1024)]
)
docs = Docs()
for doc in ("myfile.pdf", "myotherfile.pdf"):
    docs.add(doc, embedding_model=model)

This setup allows you to integrate different vector stores and embedding models as needed. However, specific support or plugins for external vector databases like Pinecone, Weaviate, or FAISS are not explicitly mentioned in the available documentation ^[1].

To continue talking to Dosu, mention @dosu.

Nov 09 '24 02:11 dosubot[bot]

@dosu Hi Dosu, I want you to show me how we can use FAISS or other Vector Stores to store our embeddings data. The example you mentioned does not contain any code that mentions explicitly how to change the Vector Store from NumpyVectorStore to other VectorStore.

Dec 28 '24 03:12 Radvian

Hello @markokow and @Radvian .

We currently have NumpyVectorStore and QDrantVectorStore implemented. Both inherit from VectorStore

If you're operating Docs directly, you can implement your own VectorStore and pass it to Docs using the texts_index argument: https://github.com/Future-House/paper-qa/blob/main/paperqa/docs.py#L76

Does that answer your question?

Jan 21 '25 23:01 maykcaldas

To use FAISS or other vector stores instead of NumpyVectorStore in PaperQA, you need to implement a new class that inherits from the VectorStore abstract base class. Here's a basic outline of how you can implement a vector store using FAISS:

import faiss
import numpy as np
from paperqa.types import Embeddable
from paperqa.llms import VectorStore

class FaissVectorStore(VectorStore):
    def __init__(self, dimension: int):
        self.index = faiss.IndexFlatL2(dimension)
        self.texts = []
        self.texts_hashes = set()

    async def add_texts_and_embeddings(self, texts: Iterable[Embeddable]) -> None:
        embeddings = np.array([t.embedding for t in texts])
        self.index.add(embeddings)
        self.texts.extend(texts)
        self.texts_hashes.update(hash(t) for t in texts)

    async def similarity_search(
        self, query: str, k: int, embedding_model: EmbeddingModel
    ) -> tuple[Sequence[Embeddable], list[float]]:
        np_query = np.array((await embedding_model.embed_documents([query]))[0])
        distances, indices = self.index.search(np_query.reshape(1, -1), k)
        return (
            [self.texts[i] for i in indices[0]],
            distances[0].tolist()
        )

    def clear(self) -> None:
        self.index.reset()
        self.texts = []
        self.texts_hashes = set()

This example assumes you have a FAISS index that uses L2 distance. You would need to adjust the index type and distance metric according to your specific requirements. Ensure that the FAISS library is installed in your environment ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

Jan 21 '25 23:01 dosubot[bot]