How adding papers to index manually
Hi,
A code snippet is provided in PaperQA' documentation to create a reusable documents index :
import os
from paperqa import Settings
from paperqa.agents.main import agent_query
from paperqa.agents.models import QueryRequest
from paperqa.agents.search import get_directory_index
async def amain(folder_of_papers: str | os.PathLike) -> None:
settings = Settings(paper_directory=folder_of_papers)
# 1. Build the index. Note an index name is autogenerated when unspecified
built_index = await get_directory_index(settings=settings)
print(settings.get_index_name()) # Display the autogenerated index name
print(await built_index.index_files) # Display the index contents
# 2. Use the settings as many times as you want with ask
answer_response_1 = await agent_query(
query=QueryRequest(
query="What is the best way to make a vaccine?", settings=settings
)
)
answer_response_2 = await agent_query(
query=QueryRequest(
query="What manufacturing challenges are unique to bispecific antibodies?",
settings=settings,
)
)
In this way we can create an index by adding papers to index automatically but it seems it's not possible to add "citation" and "docname" for each paper as it can be done when creating a Doc() object. See following :
for doc in tqdm(file_list):
try:
docs.add(str("./Papers/ToAdd/"+str(doc)),
citation="File " + doc,
docname=doc,
settings=settings,
embedding_model=hybrid_model)
shutil.move(str("./Papers/ToAdd/"+str(doc)),str("./Papers/Added/"+str(doc)))
added_doc += 1
except Exception as e:
# sometimes this happens if PDFs aren't downloaded or readable
print("Could not read", doc, e)
shutil.move(str("./Papers/ToAdd/"+str(doc)),str("./Papers/Failed/"+str(doc)))
failed_doc += 1
continue
with open("./Docs/my_docs3.pkl", "wb") as f:
compress_pickle.dump(docs, f, compression="lzma")
So I wonder if it is a way to add papers manually to an empty IndexSearch() object, iteratively, as it can be done with Doc() objects.
Also, documentation provides snippet to fill a Doc() object with papers form ZoteroDB :
from paperqa import Docs
from paperqa.contrib import ZoteroDB
docs = Docs()
zotero = ZoteroDB(library_type="user") # "group" if group library
for item in zotero.iterate(limit=20):
if item.num_pages > 30:
continue # skip long papers
docs.add(item.pdf, docname=item.key)
Is it a way to create reusable index from ZoteroDB ?
Best regards.
@Dosu.
To manually add papers to an IndexSearch object in PaperQA, you can use the add_document method of the SearchIndex class. This allows you to add documents with metadata such as 'citation' and 'docname'. Here's a general approach:
-
Create an
index_docDictionary: Include metadata like 'title', 'year', 'file_location', and 'body' of the document. -
Prepare the Document: If you have a document object, prepare it for storage.
-
Use the
add_documentMethod: Add the document to the index by passing theindex_docdictionary and the document object.
Example:
await search_index.add_document(
{
"title": "Document Title",
"year": "2023",
"file_location": "/path/to/document",
"body": "The full text of the document goes here.",
},
document=your_document_object, # Optional
)
This method handles the addition of the document to the index, including writing the document to storage if needed [1].
For creating a reusable index from ZoteroDB, you can iterate over items in the Zotero library using the ZoteroDB class and add them to a Docs object. This process involves fetching metadata and PDFs from Zotero and adding them to the Docs object, which can then be used to build an index. Here's a snippet for adding documents from Zotero:
from paperqa import Docs
from paperqa.contrib import ZoteroDB
docs = Docs()
zotero = ZoteroDB(library_type="user") # "group" if group library
for item in zotero.iterate(limit=20):
if item.num_pages > 30:
continue # skip long papers
docs.add(item.pdf, docname=item.key)
This method handles metadata by using the Zotero library to fetch and manage document details [2].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
I've noticed, agent queries on IndexSearch() object and query method performed on Doc() object (see above) give very different results.
What are fundamental algorithmic differences between them ?
Best regards.
In the exemple you gave :
await search_index.add_document(
{
"title": "Document Title",
"year": "2023",
"file_location": "/path/to/document",
"body": "The full text of the document goes here.",
},
document=your_document_object, # Optional
)
- how can I get the document body (which function) ?
- how should i create document object for each doc ?
Best regards.
@Dosu.
The fundamental algorithmic difference between querying with an IndexSearch object and a Doc object in PaperQA lies in the underlying search algorithms they use. The Doc object uses the Maximal Marginal Relevance (MMR) search algorithm to query documents, which helps in retrieving the most relevant texts based on the input query [1]. Unfortunately, I couldn't find specific details about the algorithm used by the IndexSearch object.
Regarding your second question:
-
Getting the Document Body: Unfortunately, I couldn't find a specific function or utility in PaperQA for extracting the full text (body) of a document from a file. You might need to implement a custom solution to read the document content, depending on the file format.
-
Creating a Document Object: The process for creating a document object for use with the
add_documentmethod wasn't explicitly detailed in the available information. However, you would typically need to prepare the document's metadata and content in a structured format, such as a dictionary, before adding it to the index. This involves gathering details like the document's title, year, file location, and body text.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report
Is there a function to get PDF document full text body ?
@Dosu.
Yes, PaperQA provides a function called parse_pdf_to_pages for extracting the full text from a PDF document. This function takes a file path to a PDF document and optionally a page size limit, and it returns a ParsedText object containing the text extracted from each page of the PDF [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report
In this function :
await search_index.add_document(
{
"title": "Document Title",
"year": "2023",
"file_location": "/path/to/document",
"body": "The full text of the document goes here.",
},
document=your_document_object, # Optional
)
Should I use ParsedText object as "body" entry or should I concatenate pages before ?
@Dosu.
The body entry in the add_document method should be a string. Therefore, you should concatenate the text from the ParsedText object into a single string before using it as the body entry [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report