graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Bug]: GraphRAG fails due to "1 validation error for BaseModelOutput"

Open R-Fischer47 opened this issue 8 months ago • 6 comments

Do you need to file an issue?

  • [x] I have searched the existing issues and this bug is not already filed.
  • [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

When running GraphRAG on 30 documents (and burning 6.6M tokens), the process failed with ❌ Errors occurred during the pipeline run, see logs for more details.

In the error logs, besides some 429's from openai, I found a pydantic validation error that likely crashed the pipeline:

{ "type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/run/run_pipeline.py", line 143, in _run_pipeline\n result = await workflow_function(config, context)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/workflows/extract_graph.py", line 46, in run_workflow\n entities, relationships = await extract_graph(\n ^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/workflows/extract_graph.py", line 106, in extract_graph\n entities, relationships = await get_summarized_entities_relationships(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/workflows/extract_graph.py", line 127, in get_summarized_entities_relationships\n entity_summaries, relationship_summaries = await summarize_descriptions(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/operations/summarize_descriptions/summarize_descriptions.py", line 150, in summarize_descriptions\n return await get_summarized(entities_df, relationships_df, semaphore)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/operations/summarize_descriptions/summarize_descriptions.py", line 120, in get_summarized\n edge_results = await asyncio.gather(*edge_futures)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/operations/summarize_descriptions/summarize_descriptions.py", line 142, in do_summarize_descriptions\n results = await strategy_exec(\n ^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/operations/summarize_descriptions/graph_intelligence_strategy.py", line 37, in run_graph_intelligence\n return await run_summarize_descriptions(llm, id, descriptions, callbacks, args)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/operations/summarize_descriptions/graph_intelligence_strategy.py", line 68, in run_summarize_descriptions\n result = await extractor(id=id, descriptions=descriptions)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/operations/summarize_descriptions/description_summary_extractor.py", line 72, in call\n result = await self._summarize_descriptions(id, descriptions)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/operations/summarize_descriptions/description_summary_extractor.py", line 109, in _summarize_descriptions\n result = await self._summarize_descriptions_with_llm(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/index/operations/summarize_descriptions/description_summary_extractor.py", line 128, in _summarize_descriptions_with_llm\n response = await self._model.achat(\n ^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/language_model/providers/fnllm/models.py", line 282, in achat\n output=BaseModelOutput(content=response.output.content),\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/pydantic/main.py", line 253, in init\n validated_self = self.pydantic_validator.validate_python(data, self_instance=self)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\npydantic_core._pydantic_core.ValidationError: 1 validation error for BaseModelOutput\ncontent\n Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]\n For further information visit https://errors.pydantic.dev/2.11/v/string_type\n", "source": "1 validation error for BaseModelOutput\ncontent\n Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]\n For further information visit https://errors.pydantic.dev/2.11/v/string_type", "details": null }

Basically, the model (4o-mini) did not properly fill the json format. This sometimes happens, even with models trained on structured output.

Would it be possible for you to implement retry when such an error is encountered?

Otherwise this library becomes completely worthless to our customer, as the upfront investement is really high & if we index their whole knowledge base it will cost thousands of euros. If I can not say with absolute certainty that the model will run through and succeed, I can not recommend that way forward.

After this validaiton error, there are another 25 errors like the following, which I believe to be follow-up errors to the first error stated above:

{ "type": "error", "data": "Error Invoking LLM", "stack": "Traceback (most recent call last):\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py", line 144, in call\n return await self._decorated_target(prompt, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/fnllm/base/services/json.py", line 78, in invoke\n return await delegate(prompt, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/fnllm/base/services/cached.py", line 128, in invoke\n cached = await self._cache.get(key)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/language_model/providers/fnllm/cache.py", line 25, in get\n return await self._cache.get(key)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/cache/json_pipeline_cache.py", line 26, in get\n if await self.has(key):\n ^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/cache/json_pipeline_cache.py", line 52, in has\n return await self._storage.has(key)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/graphrag/storage/file_pipeline_storage.py", line 129, in has\n return await exists(join_path(self._root_dir, key))\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/richardfischer/Projects/NIZO/solution/NIZO-KnowledgeRetrieval/.venv/lib/python3.11/site-packages/aiofiles/ospath.py", line 14, in run\n return await loop.run_in_executor(executor, pfunc)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nasyncio.exceptions.CancelledError\n", "source": "", "details": { "prompt": "\nYou are a helpful assistant responsible for generating a comprehensive summary of the data provided below.\nGiven one or two entities, and a list of descriptions, all related to the same entity or group of entities.\nPlease concatenate all of these into a single, comprehensive description. Make sure to include information collected from all the descriptions.\nIf the provided descriptions are contradictory, please resolve the contradictions and provide a single, coherent summary.\nMake sure it is written in third person, and include the entity names so we have the full context.\n\n#######\n-Data-\nEntities: ["NIZO FOOD RESEARCH B.V.", "DANONE VITAPOLE"]\nDescription List: ["NIZO Food Research B.V. conducted a study involving diets formulated by Danone Vitapole for rats", "NIZO Food Research B.V. conducted research using samples provided by Danone Vitapole", "NIZO food research B.V. collaborates with Danone Vitapole to discuss probiotic strains and their health effects"]\n#######\nOutput:\n", "kwargs": { "name": "summarize", "model_parameters": { "max_tokens": 500 } } } }

I am looking forward to hearing your opinion and what we can do to make GraphRAG more resilient!

Steps to reproduce

  • 4o-mini
  • api version: 2024-10-21
  • settings yaml:

models: default_chat_model: type: azure_openai_chat # or azure_openai_chat api_base: {set_but_removed} api_version: "2024-10-21" auth_type: api_key # or azure_managed_identity api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file # audience: "https://cognitiveservices.azure.com/.default" # organization: <organization_id> model: gpt-4o-mini deployment_name: gpt-4o-mini # encoding_model: cl100k_base # automatically set by tiktoken if left undefined model_supports_json: true # recommended if this is available for your model. concurrent_requests: 25 # max number of simultaneous LLM requests allowed async_mode: threaded # or asyncio retry_strategy: native max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response) tokens_per_minute: 0 # set to 0 to disable rate limiting requests_per_minute: 0 # set to 0 to disable rate limiting default_embedding_model: type: azure_openai_embedding # or azure_openai_embedding api_base: {set_but_removed} api_version: "2024-10-21" auth_type: api_key # or azure_managed_identity api_key: ${GRAPHRAG_API_KEY} # audience: "https://cognitiveservices.azure.com/.default" # organization: <organization_id> model: text-embedding-3-large deployment_name: text-embedding-3-large # encoding_model: cl100k_base # automatically set by tiktoken if left undefined model_supports_json: true # recommended if this is available for your model. concurrent_requests: 25 # max number of simultaneous LLM requests allowed async_mode: threaded # or asyncio retry_strategy: native max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response) tokens_per_minute: 0 # set to 0 to disable rate limiting requests_per_minute: 0 # set to 0 to disable rate limiting

Expected Behavior

I expect it to handle errors like that gracefully and not waste the 6m tokens I just spent on creating the graph with a small subset of the whole dataset.

GraphRAG Config Used

This config file contains required core defaults that must be set, along with a handful of common optional settings.

For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

LLM settings

There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

models: default_chat_model: type: azure_openai_chat # or azure_openai_chat api_base: {set_but_removed} api_version: "2024-10-21" auth_type: api_key # or azure_managed_identity api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file # audience: "https://cognitiveservices.azure.com/.default" # organization: <organization_id> model: gpt-4o-mini deployment_name: gpt-4o-mini # encoding_model: cl100k_base # automatically set by tiktoken if left undefined model_supports_json: true # recommended if this is available for your model. concurrent_requests: 25 # max number of simultaneous LLM requests allowed async_mode: threaded # or asyncio retry_strategy: native max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response) tokens_per_minute: 0 # set to 0 to disable rate limiting requests_per_minute: 0 # set to 0 to disable rate limiting default_embedding_model: type: azure_openai_embedding # or azure_openai_embedding api_base: {set_but_removed} api_version: "2024-10-21" auth_type: api_key # or azure_managed_identity api_key: ${GRAPHRAG_API_KEY} # audience: "https://cognitiveservices.azure.com/.default" # organization: <organization_id> model: text-embedding-3-large deployment_name: text-embedding-3-large # encoding_model: cl100k_base # automatically set by tiktoken if left undefined model_supports_json: true # recommended if this is available for your model. concurrent_requests: 25 # max number of simultaneous LLM requests allowed async_mode: threaded # or asyncio retry_strategy: native max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response) tokens_per_minute: 0 # set to 0 to disable rate limiting requests_per_minute: 0 # set to 0 to disable rate limiting

vector_store: default_vector_store: type: lancedb db_uri: output/lancedb container_name: default overwrite: True

embed_text: model_id: default_embedding_model vector_store_id: default_vector_store

Input settings

input: type: file # or blob file_type: text # [csv, text, json] base_dir: "input"

chunks: size: 1200 overlap: 100 group_by_columns: [id]

Output settings

If blob storage is specified in the following four sections,

connection_string and container_name must be provided

cache: type: file # [file, blob, cosmosdb] base_dir: "cache"

reporting: type: file # [file, blob, cosmosdb] base_dir: "logs"

output: type: file # [file, blob, cosmosdb] base_dir: "output"

Workflow settings

extract_graph: model_id: default_chat_model prompt: "prompts/extract_graph.txt" entity_types: [organization,person,geo,event] max_gleanings: 1

summarize_descriptions: model_id: default_chat_model prompt: "prompts/summarize_descriptions.txt" max_length: 500

extract_graph_nlp: text_analyzer: extractor_type: regex_english # [regex_english, syntactic_parser, cfg]

extract_claims: enabled: false model_id: default_chat_model prompt: "prompts/extract_claims.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 1

community_reports: model_id: default_chat_model graph_prompt: "prompts/community_report_graph.txt" text_prompt: "prompts/community_report_text.txt" max_length: 2000 max_input_length: 8000

cluster_graph: max_cluster_size: 10

embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes

umap: enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots: graphml: true embeddings: false

Query settings

The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.

See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search: chat_model_id: default_chat_model embedding_model_id: default_embedding_model prompt: "prompts/local_search_system_prompt.txt"

global_search: chat_model_id: default_chat_model map_prompt: "prompts/global_search_map_system_prompt.txt" reduce_prompt: "prompts/global_search_reduce_system_prompt.txt" knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search: chat_model_id: default_chat_model embedding_model_id: default_embedding_model prompt: "prompts/drift_search_system_prompt.txt" reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search: chat_model_id: default_chat_model embedding_model_id: default_embedding_model prompt: "prompts/basic_search_system_prompt.txt"

Logs and screenshots

see above

Additional Information

  • GraphRAG Version: 2.1.0
  • Operating System: MacOS
  • Python Version: 3.11.11
  • Related Issues:

R-Fischer47 avatar Apr 07 '25 15:04 R-Fischer47

If you set model_supports_json: true in your model config it should enforce JSON via the OpenAI API call (which should be compatible with 4o-mini). If you don't set that we do have fallback checks for malformed JSON, but can't catch everything.

Re: burned tokens: we have aggressive caching of the LLM calls, so if you re-run indexing it should skip over the ones that are already complete and continue where it failed.

natoverse avatar Apr 08 '25 19:04 natoverse

As you can see in the config file that I posted, model_supports_json was indeed set to true. Do you have the fallbacks in place even if model_supports_json is set to true?

Do you retry when an unparsable structured output is returned? Is that even the mistake here? The response came back as None, so there might be something else going on too?

I will rerun it and see how many tokens it will burn this time around.

R-Fischer47 avatar Apr 09 '25 09:04 R-Fischer47

We supply a pydantic model for JSON requests, which OpenAI guarantees will validate. That may not equate to a valid response if None is returned though. From their post I wonder if this may be a request refusal. Will dig in a bit more.

natoverse avatar Apr 09 '25 18:04 natoverse

I was wrong earlier - the version of fnllm we are on does not yet supply the Pydantic model to OpenAI. We try to parse and fit into the model. The code here indicates it should throw an exception on parse errors, but will return None if None is sent.

natoverse avatar Apr 10 '25 00:04 natoverse

This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.

github-actions[bot] avatar Apr 17 '25 02:04 github-actions[bot]

Hi community members and maintainers. Thank you for your amazing work!

I put a breakpoint here and observed that response.output.raw_model.choices[0].model_extra["content_filter_results"]["sexual"] is accidentally set to {'filtered': True, 'severity': 'medium'}.

I believe above is just an example of unexpected response, and we should never expect the FNLLM response response.output.content to be always valid string.

I hope this problem is solved soon as single LLM call error ruins the whole indexing pipeline.

yotaro-shimose avatar Apr 24 '25 03:04 yotaro-shimose