paper-qa How adding papers to index manually

Hi,

A code snippet is provided in PaperQA' documentation to create a reusable documents index :

import os

from paperqa import Settings
from paperqa.agents.main import agent_query
from paperqa.agents.models import QueryRequest
from paperqa.agents.search import get_directory_index


async def amain(folder_of_papers: str | os.PathLike) -> None:
    settings = Settings(paper_directory=folder_of_papers)

    # 1. Build the index. Note an index name is autogenerated when unspecified
    built_index = await get_directory_index(settings=settings)
    print(settings.get_index_name())  # Display the autogenerated index name
    print(await built_index.index_files)  # Display the index contents

    # 2. Use the settings as many times as you want with ask
    answer_response_1 = await agent_query(
        query=QueryRequest(
            query="What is the best way to make a vaccine?", settings=settings
        )
    )
    answer_response_2 = await agent_query(
        query=QueryRequest(
            query="What manufacturing challenges are unique to bispecific antibodies?",
            settings=settings,
        )
    )

In this way we can create an index by adding papers to index automatically but it seems it's not possible to add "citation" and "docname" for each paper as it can be done when creating a Doc() object. See following :

for doc in tqdm(file_list):
    try:
        docs.add(str("./Papers/ToAdd/"+str(doc)),
                     citation="File " + doc,
                     docname=doc,
                     settings=settings,
                     embedding_model=hybrid_model)
        shutil.move(str("./Papers/ToAdd/"+str(doc)),str("./Papers/Added/"+str(doc)))
        added_doc += 1
    except Exception as e:
        # sometimes this happens if PDFs aren't downloaded or readable
        print("Could not read", doc, e)
        shutil.move(str("./Papers/ToAdd/"+str(doc)),str("./Papers/Failed/"+str(doc)))
        failed_doc += 1
        continue
                 
with open("./Docs/my_docs3.pkl", "wb") as f:
    compress_pickle.dump(docs, f, compression="lzma")

So I wonder if it is a way to add papers manually to an empty IndexSearch() object, iteratively, as it can be done with Doc() objects.

Also, documentation provides snippet to fill a Doc() object with papers form ZoteroDB :

from paperqa import Docs
from paperqa.contrib import ZoteroDB

docs = Docs()
zotero = ZoteroDB(library_type="user")  # "group" if group library

for item in zotero.iterate(limit=20):
    if item.num_pages > 30:
        continue  # skip long papers
    docs.add(item.pdf, docname=item.key)

Is it a way to create reusable index from ZoteroDB ?

Best regards.

@Dosu.

Jan 03 '25 15:01 Snikch63200

To manually add papers to an IndexSearch object in PaperQA, you can use the add_document method of the SearchIndex class. This allows you to add documents with metadata such as 'citation' and 'docname'. Here's a general approach:

Create an index_doc Dictionary: Include metadata like 'title', 'year', 'file_location', and 'body' of the document.
Prepare the Document: If you have a document object, prepare it for storage.
Use the add_document Method: Add the document to the index by passing the index_doc dictionary and the document object.

Example:

await search_index.add_document(
    {
        "title": "Document Title",
        "year": "2023",
        "file_location": "/path/to/document",
        "body": "The full text of the document goes here.",
    },
    document=your_document_object,  # Optional
)

This method handles the addition of the document to the index, including writing the document to storage if needed ^[1].

For creating a reusable index from ZoteroDB, you can iterate over items in the Zotero library using the ZoteroDB class and add them to a Docs object. This process involves fetching metadata and PDFs from Zotero and adding them to the Docs object, which can then be used to build an index. Here's a snippet for adding documents from Zotero:

from paperqa import Docs
from paperqa.contrib import ZoteroDB

docs = Docs()
zotero = ZoteroDB(library_type="user")  # "group" if group library

for item in zotero.iterate(limit=20):
    if item.num_pages > 30:
        continue  # skip long papers
    docs.add(item.pdf, docname=item.key)

This method handles metadata by using the Zotero library to fetch and manage document details ^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

Jan 03 '25 15:01 dosubot[bot]

I've noticed, agent queries on IndexSearch() object and query method performed on Doc() object (see above) give very different results.

What are fundamental algorithmic differences between them ?

Best regards.

Jan 20 '25 10:01 Snikch63200

In the exemple you gave :

await search_index.add_document(
    {
        "title": "Document Title",
        "year": "2023",
        "file_location": "/path/to/document",
        "body": "The full text of the document goes here.",
    },
    document=your_document_object,  # Optional
)

how can I get the document body (which function) ?
how should i create document object for each doc ?

Best regards.

@Dosu.

Feb 12 '25 08:02 Snikch63200

The fundamental algorithmic difference between querying with an IndexSearch object and a Doc object in PaperQA lies in the underlying search algorithms they use. The Doc object uses the Maximal Marginal Relevance (MMR) search algorithm to query documents, which helps in retrieving the most relevant texts based on the input query ^[1]. Unfortunately, I couldn't find specific details about the algorithm used by the IndexSearch object.

Regarding your second question:

Getting the Document Body: Unfortunately, I couldn't find a specific function or utility in PaperQA for extracting the full text (body) of a document from a file. You might need to implement a custom solution to read the document content, depending on the file format.
Creating a Document Object: The process for creating a document object for use with the add_document method wasn't explicitly detailed in the available information. However, you would typically need to prepare the document's metadata and content in a structured format, such as a dictionary, before adding it to the index. This involves gathering details like the document's title, year, file location, and body text.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report}

Feb 12 '25 08:02 dosubot[bot]

Is there a function to get PDF document full text body ?

@Dosu.

Feb 12 '25 09:02 Snikch63200

Yes, PaperQA provides a function called parse_pdf_to_pages for extracting the full text from a PDF document. This function takes a file path to a PDF document and optionally a page size limit, and it returns a ParsedText object containing the text extracted from each page of the PDF ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report}

Feb 12 '25 09:02 dosubot[bot]

In this function :

await search_index.add_document(
    {
        "title": "Document Title",
        "year": "2023",
        "file_location": "/path/to/document",
        "body": "The full text of the document goes here.",
    },
    document=your_document_object,  # Optional
)

Should I use ParsedText object as "body" entry or should I concatenate pages before ?

@Dosu.

Feb 12 '25 09:02 Snikch63200

The body entry in the add_document method should be a string. Therefore, you should concatenate the text from the ParsedText object into a single string before using it as the body entry ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report}

Feb 12 '25 09:02 dosubot[bot]