paper-qa
paper-qa copied to clipboard
EmptyDocsError: Not gathering evidence due to having no papers.
Why show ”EmptyDocsError: Not gathering evidence due to having no papers.“ It can not detect my paper?
The error "EmptyDocsError: Not gathering evidence due to having no papers." appears because the gather_evidence method in the GatherEvidence class checks if there are any documents present in the state.docs.docs. If no documents are found, it raises the EmptyDocsError. This indicates that no papers have been detected or added to the state before calling the gather_evidence method [1].
To resolve this, ensure that papers are correctly added to state.docs.docs using methods like aadd_texts in the Docs class. This can be done through the paper_search function or by directly adding documents to the Docs object [1][2].
To continue talking to Dosu, mention @dosu.
How to modify in python program, explain in detail.
To ensure that papers are correctly added to state.docs.docs and avoid the "EmptyDocsError," you need to verify and possibly modify the paper_search method in the PaperSearch class. Here is a detailed explanation of the necessary steps:
-
Ensure
paper_searchMethod Updatesstate.docs.docs: Thepaper_searchmethod should correctly update thestate.docs.docsattribute by adding documents using theaadd_textsmethod. Here is the relevant part of the code:async def paper_search( self, query: str, min_year: int | None, max_year: int | None, state: EnvironmentState, ) -> str: # ... (other code) # combine all the resulting doc objects into one and update the state all_docs = [] for r in results: this_doc = next(iter(r.docs.values())) all_docs.append(this_doc) await state.docs.aadd_texts( texts=r.texts, doc=this_doc, settings=self.settings, embedding_model=self.embedding_model, ) status = state.status logger.info(status) # mark how far we've searched so that continuation will start at the right place self.previous_searches[search_key] += self.settings.agent.search_count if self.settings.agent.return_paper_metadata: retrieved_papers = "\n".join([f"{x.title} ({x.year})" for x in all_docs]) return f"Retrieved Papers:\n{retrieved_papers}\n\n{status}" return status -
Verify
resultsfromindex.query: Ensure that theresultsfrom theindex.querycall are not empty. Ifresultsis empty, no documents will be added tostate.docs.docs, leading to the "EmptyDocsError". -
Check
aadd_textsMethod: Theaadd_textsmethod in theDocsclass should correctly add texts to thedocsdictionary. Here is the method signature and its parameters:async def aadd_texts( self, texts: list[Text], doc: Doc, settings: MaybeSettings = None, embedding_model: EmbeddingModel | None = None, ) -> bool:Ensure that this method is functioning correctly and that the
textsanddocparameters are properly passed. -
Ensure
DocsClass is Correctly Implemented: TheDocsclass should manage thedocsdictionary and other related attributes properly. Here is a snippet of theDocsclass for reference:class Docs(BaseModel): """A collection of documents to be used for answering questions.""" model_config = ConfigDict(extra="forbid") id: UUID = Field(default_factory=uuid4) docs: dict[DocKey, Doc | DocDetails] = Field(default_factory=dict) texts: list[Text] = Field(default_factory=list) docnames: set[str] = Field(default_factory=set) texts_index: VectorStore = Field(default_factory=NumpyVectorStore) name: str = Field(default="default", description="Name of this docs collection") index_path: Path | None = Field( default=PAPERQA_DIR, description="Path to save index", validate_default=True ) deleted_dockeys: set[DocKey] = Field(default_factory=set) @field_validator("index_path") @classmethod def handle_default(cls, value: Path | None, info: ValidationInfo) -> Path | None: if value == PAPERQA_DIR: return PAPERQA_DIR / info.data["name"] return value def clear_docs(self) -> None: self.texts = [] self.docs = {} self.docnames = set() def _get_unique_name(self, docname: str) -> str: """Create a unique name given proposed name.""" suffix = "" while (docname + suffix) in self.docnames: suffix = "a" if suffix == "" else chr(ord(suffix) + 1) docname += suffix return docname async def aadd_file( self, file: BinaryIO, citation: str | None = None, docname: str | None = None, dockey: DocKey | None = None, title: str | None = None, doi: str | None = None, authors: list[str] | None = None, settings: MaybeSettings = None, llm_model: LLMModel | None = None, embedding_model: EmbeddingModel | None = None, **kwargs, ) -> str | None: """Add a document to the collection.""" # just put in temp file and use existing method suffix = ".txt" if maybe_is_pdf(file): suffix = ".pdf" elif maybe_is_html(file): suffix = ".html" with tempfile.NamedTemporaryFile(suffix=suffix) as f: f.write(file.read()) f.seek(0) return await self.aadd( Path(f.name), citation=citation, docname=docname, dockey=dockey, title=title, doi=doi, authors=authors, settings=settings, llm_model=llm_model, embedding_model=embedding_model, **kwargs, ) async def aadd( self, path: Path, citation: str | None = None, docname: str | None = None, dockey: DocKey | None = None, title: str | None = None, doi: str | None = None, authors: list[str] | None = None, settings: MaybeSettings = None, llm_model: LLMModel | None = None, embedding_model: EmbeddingModel | None = None, **kwargs, ) -> str | None: """Add a document to the collection.""" all_settings = get_settings(settings) parse_config = all_settings.parsing if dockey is None: dockey = md5sum(path) if llm_model is None: llm_model = all_settings.get_llm() if citation is None: texts = read_doc( path, Doc(docname="", citation="", dockey=dockey), # Fake doc chunk_chars=parse_config.chunk_size, overlap=parse_config.overlap, ) if not texts: raise ValueError(f"Could not read document {path}. Is it empty?") result = await llm_model.run_prompt( prompt=parse_config.citation_prompt, data={"text": texts[0].text}, skip_system=True, ) citation = result.text if ( len(citation) < 3 or "Unknown" in citation or "insufficient" in citation ): citation = f"Unknown, {os.path.basename(path)}, {datetime.now().year}" if docname is None: match = re.search(r"([A-Z][a-z]+)", citation) if match is not None: author = match.group(1) else: raise ValueError( f"Could not parse docname from citation {citation}. " "Consider just passing key explicitly - e.g. docs.py " "(path, citation, key='mykey')" ) year = "" match = re.search(r"(\d{4})", citation) if match is not None: year = match.group(1) docname = f"{author}{year}" docname = self._get_unique_name(docname) doc = Doc(docname=docname, citation=citation, dockey=dockey) self.docs[dockey] = doc self.docnames.add(docname) self.texts.extend(texts) self._build_texts_index() return docname
By ensuring that the paper_search method correctly updates state.docs.docs and verifying the implementation of the Docs class and its methods, you can avoid the "EmptyDocsError" and ensure that papers are correctly detected and added to the state [1][2].
To continue talking to Dosu, mention @dosu.
HI there, I hit the same problems as yours. Did you solved that 'Not gathering evidence due to having no papers' problem by following Dosu's suggestions? Could you please let us know where the key is. thanks.
I attached my output info as follows. There is a pdf file in my_papers/ there. I can use openai api very correctly. when I changed the model to ollama/llama3.2, it seems not working properly.
I put some print(...) code inside 'async def paper_search(...)', but fail to see that. It seems the paper_search() is never called.
OUTPUT INFO:
(paperqa) D:\git-base\paper-qa\my_papers>pqa --settings my_local_settings ask "what is PENS dataset?" C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\pydantic_internal_config.py:345: UserWarning: Valid config keys have changed in V2:
- 'fields' has been removed warnings.warn(message, UserWarning) entering main --------- cli_settings: CliSettingsSource(env_nested_delimiter='.', env_prefix_len=0) args: Namespace(settings='my_local_settings', index='default', command='ask', query='what is PENS dataset?') remaining_args: [] Could not find cost for model ollama/llama3.2. Encountered exception during tool call for tool gather_evidence: EmptyDocsError('Not gathering evidence due to having no papers.') [12:32:16] Trajectory failed. ╭──────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮ │ C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\main.py:463 in acompletion │ │ │ │ 460 │ │ │ │ response = ModelResponse(**init_response) │ │ 461 │ │ │ response = init_response │ │ 462 │ │ elif asyncio.iscoroutine(init_response): │ │ ❱ 463 │ │ │ response = await init_response │ │ 464 │ │ else: │
.......(truincated for brevity) ......
│ 265 │ │ │ │ │ │ }, │ │ 266 │ │ │ │ │ │ "type": "function", │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'name'
During handling of the above exception, another exception occurred:
╭──────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮ │ C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\paperqa\agents\main.py:157 in │ │ _run_with_timeout_failure │ │ │ │ 154 ) -> tuple[PQASession, AgentStatus]: │ │ 155 │ try: │ │ 156 │ │ async with asyncio.timeout(settings.agent.timeout): │
.......(truincated for brevity) ......
│ 2169 │ │ │ │ │ ), │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
APIConnectionError: litellm.APIConnectionError: 'name'
Traceback (most recent call last):
File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\main.py", line 463, in acompletion
response = await init_response
^^^^^^^^^^^^^^^^^^^
File
"C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\llms\custom_httpx\llm_http_handler.py", line
177, in async_completion
return provider_config.transform_response(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\llms\ollama\completion\transformation.py",
line 263, in transform_response
"name": function_call["name"],
~~~~~~~~~~~~~^^^^^^^^
KeyError: 'name'
Received Model Group=ollama/llama3.2
Available Model Group Fallbacks=None LiteLLM Retried: 2 times, LiteLLM Max Retries: 3
[12:32:24] Generating answer for 'what is PENS dataset?'. Could not find cost for model ollama/llama3.2. [12:35:38] Status: Paper Count=0 | Relevant Papers=0 | Current Evidence=0 | Current Cost=$0.0000 Answer: I cannot provide information about the PENS dataset as it is not a widely recognized or well-documented dataset. Can I help you with something else?
(paperqa) D:\git-base\paper-qa\my_papers>
Hello @SURUIYUAN and @wight911,
@SURUIYUAN , do you mind sharing a minimal code to reproduce the problem? It is possible that your settings.agent.index.paper_directory points to the wrong directory?
Please check this tutorial where we show a little on how to work with the Settings class.
@wight911 It seems that the cli worked well with OpenAI's models, is that correct? We know that ollama/llama3.2 fails in follow the required schema sometimes and it leads to some errors. Please check this discussion
Hi all to circle back on a few things here:
Could not find cost for model ollama/llama3.2.
This is a litellm missing cost entry: https://github.com/BerriAI/litellm/issues/9644
APIConnectionError: litellm.APIConnectionError: 'name' Traceback (most recent call last): File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\main.py", line 463, in acompletion response = await init_response ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\llms\custom_httpx\llm_http_handler.py", line 177, in async_completion return provider_config.transform_response( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\llms\ollama\completion\transformation.py", line 263, in transform_response "name": function_call["name"], ~~~~~~~~~~~~~^^^^^^^^ KeyError: 'name'
This is a litellm bug: https://github.com/BerriAI/litellm/issues/8594
Hopefully they get fixed, but these are not paper-qa issues, they're issues in our upstream