graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Bug]: The gleaning is *not* including the original input

Open ksachdeva opened this issue 1 year ago • 0 comments

Describe the bug

image

If you look at the below function:

https://github.com/microsoft/graphrag/blob/309abc982f158c38099c6098d30b35a20972d258/graphrag/index/graph/extractors/graph/graph_extractor.py#L148C5-L182C23

async def _process_document(
        self, text: str, prompt_variables: dict[str, str]
    ) -> str:
        response = await self._llm(
            self._extraction_prompt,
            variables={
                **prompt_variables,
                self._input_text_key: text,
            },
        )
        results = response.output or ""

        # Repeat to ensure we maximize entity count
        for i in range(self._max_gleanings):
            glean_response = await self._llm(
                CONTINUE_PROMPT,
                name=f"extract-continuation-{i}",
                history=response.history or [],
            )
            results += glean_response.output or ""

            # if this is the final glean, don't bother updating the continuation flag
            if i >= self._max_gleanings - 1:
                break

            continuation = await self._llm(
                LOOP_PROMPT,
                name=f"extract-loopcheck-{i}",
                history=glean_response.history or [],
                model_parameters=self._loop_args,
            )
            if continuation.output != "YES":
                break

        return results

The call to do gleaning does not include the original input

glean_response = await self._llm(
                CONTINUE_PROMPT,
                name=f"extract-continuation-{i}",
                history=response.history or [],
            )

response.history only includes the last output from the LLM i.e. it is missing the GRAPH_EXTRACTION_PROMPT prompt.

I have verified this by looking at the exchange in the debugger as well.

Is this expected implementation? I would have thought that in order to glean at minimum the original text (chunk) would be required. Here the gleaning seems to only use the last response

Please guide. Thanks.

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

No response

Logs and screenshots

No response

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:

ksachdeva avatar Jul 18 '24 18:07 ksachdeva