paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Hoping for confirmation of a few high-level ideas?

Open plbremer opened this issue 8 months ago • 0 comments

Hi,

Thanks for putting together this very compelling tooling. I was hoping to ask a few specific questions about what is going on to make sure that everything is working as we expect before trying to productionize :)

  1. We can/should build a classic document retrieval index with Tantivy up-front in the case of >10,000 <100,000 documents. This index does not involve a vector store at all.
  2. In the publication's Figure 1a, the Tantivy document store is the tool that the Paper Search agent is interacting with.
  3. Any vectorization that occurs happens on-the-fly with the Gather Evidence Agent. Where is this vectorization stored? Is it possible to slowly accumulate vectors somewhere? I recognize that we can save a Docs object, however, every query will probably have a unique set of documents that is retrieved, so it is not clear if we can meaningfully aggregate previous vectorizations. (obviously the system works even if we cant accumulate these meaningfully)
  4. The README mentions options for larger-than-memory vector stores. Is this relevant for anything other than opting for a tremendously large k? Can we parametrically avoid this?
  5. If you have custom citations, or no citations, will the Citation Traversal agent simply not operate? Where does the citation graph come from? If I have internal documents, can I provide my own?
  6. It looks like my answers triggered the creation of an index. Is there any documentation around interacting with that SeachIndex?

Thanks for your time.

plbremer avatar Mar 27 '25 21:03 plbremer