paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

EmptyDocsError: Not gathering evidence due to having no papers.

Open SURUIYUAN opened this issue 1 year ago • 3 comments

Why show ”EmptyDocsError: Not gathering evidence due to having no papers.“ It can not detect my paper?

SURUIYUAN avatar Sep 24 '24 02:09 SURUIYUAN

The error "EmptyDocsError: Not gathering evidence due to having no papers." appears because the gather_evidence method in the GatherEvidence class checks if there are any documents present in the state.docs.docs. If no documents are found, it raises the EmptyDocsError. This indicates that no papers have been detected or added to the state before calling the gather_evidence method [1].

To resolve this, ensure that papers are correctly added to state.docs.docs using methods like aadd_texts in the Docs class. This can be done through the paper_search function or by directly adding documents to the Docs object [1][2].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 24 '24 02:09 dosubot[bot]

How to modify in python program, explain in detail.

SURUIYUAN avatar Sep 24 '24 02:09 SURUIYUAN

To ensure that papers are correctly added to state.docs.docs and avoid the "EmptyDocsError," you need to verify and possibly modify the paper_search method in the PaperSearch class. Here is a detailed explanation of the necessary steps:

  1. Ensure paper_search Method Updates state.docs.docs: The paper_search method should correctly update the state.docs.docs attribute by adding documents using the aadd_texts method. Here is the relevant part of the code:

    async def paper_search(
        self,
        query: str,
        min_year: int | None,
        max_year: int | None,
        state: EnvironmentState,
    ) -> str:
        # ... (other code)
    
        # combine all the resulting doc objects into one and update the state
        all_docs = []
        for r in results:
            this_doc = next(iter(r.docs.values()))
            all_docs.append(this_doc)
            await state.docs.aadd_texts(
                texts=r.texts,
                doc=this_doc,
                settings=self.settings,
                embedding_model=self.embedding_model,
            )
    
        status = state.status
        logger.info(status)
        # mark how far we've searched so that continuation will start at the right place
        self.previous_searches[search_key] += self.settings.agent.search_count
        if self.settings.agent.return_paper_metadata:
            retrieved_papers = "\n".join([f"{x.title} ({x.year})" for x in all_docs])
            return f"Retrieved Papers:\n{retrieved_papers}\n\n{status}"
        return status
    
  2. Verify results from index.query: Ensure that the results from the index.query call are not empty. If results is empty, no documents will be added to state.docs.docs, leading to the "EmptyDocsError".

  3. Check aadd_texts Method: The aadd_texts method in the Docs class should correctly add texts to the docs dictionary. Here is the method signature and its parameters:

    async def aadd_texts(
        self,
        texts: list[Text],
        doc: Doc,
        settings: MaybeSettings = None,
        embedding_model: EmbeddingModel | None = None,
    ) -> bool:
    

    Ensure that this method is functioning correctly and that the texts and doc parameters are properly passed.

  4. Ensure Docs Class is Correctly Implemented: The Docs class should manage the docs dictionary and other related attributes properly. Here is a snippet of the Docs class for reference:

    class Docs(BaseModel):
        """A collection of documents to be used for answering questions."""
    
        model_config = ConfigDict(extra="forbid")
    
        id: UUID = Field(default_factory=uuid4)
        docs: dict[DocKey, Doc | DocDetails] = Field(default_factory=dict)
        texts: list[Text] = Field(default_factory=list)
        docnames: set[str] = Field(default_factory=set)
        texts_index: VectorStore = Field(default_factory=NumpyVectorStore)
        name: str = Field(default="default", description="Name of this docs collection")
        index_path: Path | None = Field(
            default=PAPERQA_DIR, description="Path to save index", validate_default=True
        )
        deleted_dockeys: set[DocKey] = Field(default_factory=set)
    
        @field_validator("index_path")
        @classmethod
        def handle_default(cls, value: Path | None, info: ValidationInfo) -> Path | None:
            if value == PAPERQA_DIR:
                return PAPERQA_DIR / info.data["name"]
            return value
    
        def clear_docs(self) -> None:
            self.texts = []
            self.docs = {}
            self.docnames = set()
    
        def _get_unique_name(self, docname: str) -> str:
            """Create a unique name given proposed name."""
            suffix = ""
            while (docname + suffix) in self.docnames:
                suffix = "a" if suffix == "" else chr(ord(suffix) + 1)
            docname += suffix
            return docname
    
        async def aadd_file(
            self,
            file: BinaryIO,
            citation: str | None = None,
            docname: str | None = None,
            dockey: DocKey | None = None,
            title: str | None = None,
            doi: str | None = None,
            authors: list[str] | None = None,
            settings: MaybeSettings = None,
            llm_model: LLMModel | None = None,
            embedding_model: EmbeddingModel | None = None,
            **kwargs,
        ) -> str | None:
            """Add a document to the collection."""
            # just put in temp file and use existing method
            suffix = ".txt"
            if maybe_is_pdf(file):
                suffix = ".pdf"
            elif maybe_is_html(file):
                suffix = ".html"
    
            with tempfile.NamedTemporaryFile(suffix=suffix) as f:
                f.write(file.read())
                f.seek(0)
                return await self.aadd(
                    Path(f.name),
                    citation=citation,
                    docname=docname,
                    dockey=dockey,
                    title=title,
                    doi=doi,
                    authors=authors,
                    settings=settings,
                    llm_model=llm_model,
                    embedding_model=embedding_model,
                    **kwargs,
                )
    
        async def aadd(
            self,
            path: Path,
            citation: str | None = None,
            docname: str | None = None,
            dockey: DocKey | None = None,
            title: str | None = None,
            doi: str | None = None,
            authors: list[str] | None = None,
            settings: MaybeSettings = None,
            llm_model: LLMModel | None = None,
            embedding_model: EmbeddingModel | None = None,
            **kwargs,
        ) -> str | None:
            """Add a document to the collection."""
            all_settings = get_settings(settings)
            parse_config = all_settings.parsing
            if dockey is None:
                dockey = md5sum(path)
            if llm_model is None:
                llm_model = all_settings.get_llm()
            if citation is None:
                texts = read_doc(
                    path,
                    Doc(docname="", citation="", dockey=dockey),  # Fake doc
                    chunk_chars=parse_config.chunk_size,
                    overlap=parse_config.overlap,
                )
                if not texts:
                    raise ValueError(f"Could not read document {path}. Is it empty?")
                result = await llm_model.run_prompt(
                    prompt=parse_config.citation_prompt,
                    data={"text": texts[0].text},
                    skip_system=True,
                )
                citation = result.text
                if (
                    len(citation) < 3
                    or "Unknown" in citation
                    or "insufficient" in citation
                ):
                    citation = f"Unknown, {os.path.basename(path)}, {datetime.now().year}"
            if docname is None:
                match = re.search(r"([A-Z][a-z]+)", citation)
                if match is not None:
                    author = match.group(1)
                else:
                    raise ValueError(
                        f"Could not parse docname from citation {citation}. "
                        "Consider just passing key explicitly - e.g. docs.py "
                        "(path, citation, key='mykey')"
                    )
                year = ""
                match = re.search(r"(\d{4})", citation)
                if match is not None:
                    year = match.group(1)
                docname = f"{author}{year}"
            docname = self._get_unique_name(docname)
    
            doc = Doc(docname=docname, citation=citation, dockey=dockey)
            self.docs[dockey] = doc
            self.docnames.add(docname)
            self.texts.extend(texts)
            self._build_texts_index()
            return docname
    

By ensuring that the paper_search method correctly updates state.docs.docs and verifying the implementation of the Docs class and its methods, you can avoid the "EmptyDocsError" and ensure that papers are correctly detected and added to the state [1][2].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 24 '24 02:09 dosubot[bot]

HI there, I hit the same problems as yours. Did you solved that 'Not gathering evidence due to having no papers' problem by following Dosu's suggestions? Could you please let us know where the key is. thanks.

wight911 avatar Feb 03 '25 03:02 wight911

I attached my output info as follows. There is a pdf file in my_papers/ there. I can use openai api very correctly. when I changed the model to ollama/llama3.2, it seems not working properly.

I put some print(...) code inside 'async def paper_search(...)', but fail to see that. It seems the paper_search() is never called.

OUTPUT INFO:

(paperqa) D:\git-base\paper-qa\my_papers>pqa --settings my_local_settings ask "what is PENS dataset?" C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\pydantic_internal_config.py:345: UserWarning: Valid config keys have changed in V2:

  • 'fields' has been removed warnings.warn(message, UserWarning) entering main --------- cli_settings: CliSettingsSource(env_nested_delimiter='.', env_prefix_len=0) args: Namespace(settings='my_local_settings', index='default', command='ask', query='what is PENS dataset?') remaining_args: [] Could not find cost for model ollama/llama3.2. Encountered exception during tool call for tool gather_evidence: EmptyDocsError('Not gathering evidence due to having no papers.') [12:32:16] Trajectory failed. ╭──────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮ │ C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\main.py:463 in acompletion │ │ │ │ 460 │ │ │ │ response = ModelResponse(**init_response) │ │ 461 │ │ │ response = init_response │ │ 462 │ │ elif asyncio.iscoroutine(init_response): │ │ ❱ 463 │ │ │ response = await init_response │ │ 464 │ │ else: │

.......(truincated for brevity) ......

 │   265 │   │   │   │   │   │   },                                                                           │           │   266 │   │   │   │   │   │   "type": "function",                                                          │           ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯           

KeyError: 'name'

During handling of the above exception, another exception occurred:

       ╭──────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮           │ C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\paperqa\agents\main.py:157 in                      │           │ _run_with_timeout_failure                                                                                  │           │                                                                                                            │           │   154 ) -> tuple[PQASession, AgentStatus]:                                                                 │           │   155 │   try:                                                                                             │           │   156 │   │   async with asyncio.timeout(settings.agent.timeout):                                          │

.......(truincated for brevity) ......

│ 2169 │ │ │ │ │ ), │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
APIConnectionError: litellm.APIConnectionError: 'name' Traceback (most recent call last): File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\main.py", line 463, in acompletion response = await init_response ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\llms\custom_httpx\llm_http_handler.py", line 177, in async_completion return provider_config.transform_response( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\llms\ollama\completion\transformation.py", line 263, in transform_response "name": function_call["name"], ~~~~~~~~~~~~~^^^^^^^^ KeyError: 'name'

       Received Model Group=ollama/llama3.2
       Available Model Group Fallbacks=None LiteLLM Retried: 2 times, LiteLLM Max Retries: 3

[12:32:24] Generating answer for 'what is PENS dataset?'. Could not find cost for model ollama/llama3.2. [12:35:38] Status: Paper Count=0 | Relevant Papers=0 | Current Evidence=0 | Current Cost=$0.0000 Answer: I cannot provide information about the PENS dataset as it is not a widely recognized or well-documented dataset. Can I help you with something else?

(paperqa) D:\git-base\paper-qa\my_papers>

wight911 avatar Feb 03 '25 04:02 wight911

Hello @SURUIYUAN and @wight911,

@SURUIYUAN , do you mind sharing a minimal code to reproduce the problem? It is possible that your settings.agent.index.paper_directory points to the wrong directory? Please check this tutorial where we show a little on how to work with the Settings class.

@wight911 It seems that the cli worked well with OpenAI's models, is that correct? We know that ollama/llama3.2 fails in follow the required schema sometimes and it leads to some errors. Please check this discussion

maykcaldas avatar Mar 13 '25 17:03 maykcaldas

Hi all to circle back on a few things here:

Could not find cost for model ollama/llama3.2.

This is a litellm missing cost entry: https://github.com/BerriAI/litellm/issues/9644

APIConnectionError: litellm.APIConnectionError: 'name' Traceback (most recent call last): File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\main.py", line 463, in acompletion response = await init_response ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\llms\custom_httpx\llm_http_handler.py", line 177, in async_completion return provider_config.transform_response( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wight\anaconda3\envs\paperqa\Lib\site-packages\litellm\llms\ollama\completion\transformation.py", line 263, in transform_response "name": function_call["name"], ~~~~~~~~~~~~~^^^^^^^^ KeyError: 'name'

This is a litellm bug: https://github.com/BerriAI/litellm/issues/8594

Hopefully they get fixed, but these are not paper-qa issues, they're issues in our upstream

jamesbraza avatar Mar 30 '25 01:03 jamesbraza