graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Issue]: KeyError: "title" when generating community reports using build_index()

Open droideronline opened this issue 8 months ago • 4 comments

Do you need to file an issue?

  • [x] I have searched the existing issues and this bug is not already filed.
  • [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
  • [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the issue

When running build_index using the Python API (Graphrag v2.2.0, Python 3.11.0), the process fails during community report extraction due to a missing "title" key in the prompt template formatting

Steps to reproduce

  1. Set up Graphrag v2.2.0 with Python 3.11.
  2. Run build_index() function via Python API.
  3. Observe error when creating community reports.

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

models:
  default_chat_model:
    type: azure_openai_chat  # or openai_chat
    api_base: https://xxxxx.com
    api_version: 2023-07-01-preview
    auth_type: api_key # or azure_managed_identity
    api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    model: gpt-4-turbo-preview
    deployment_name: gpt-4o-mini-2024-07-18 # deployment name is modal name
    # encoding_model: cl100k_base # automatically set by tiktoken if left undefined
    model_supports_json: true # recommended if this is available for your model.
    concurrent_requests: 25 # max number of simultaneous LLM requests allowed
    async_mode: threaded # or asyncio
    retry_strategy: native
    max_retries: -1                   # set to -1 for dynamic retry logic (most optimal setting based on server response)
    tokens_per_minute: 0              # set to 0 to disable rate limiting
    requests_per_minute: 0            # set to 0 to disable rate limiting
  default_embedding_model:
    type: azure_openai_embedding # or openai_embedding
    api_base: https://xxxxx.com
    api_version: 2023-07-01-preview
    auth_type: api_key # or azure_managed_identity
    api_key: ${GRAPHRAG_API_KEY}
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    model: text-embedding-3-small
    deployment_name: text-embedding-ada-002
    # encoding_model: cl100k_base # automatically set by tiktoken if left undefined
    model_supports_json: true # recommended if this is available for your model.
    concurrent_requests: 25 # max number of simultaneous LLM requests allowed
    async_mode: threaded # or asyncio
    retry_strategy: native
    max_retries: -1                   # set to -1 for dynamic retry logic (most optimal setting based on server response)
    tokens_per_minute: 0              # set to 0 to disable rate limiting
    requests_per_minute: 0            # set to 0 to disable rate limiting

vector_store:
  default_vector_store:
    type: lancedb
    db_uri: output\lancedb
    container_name: default
    overwrite: True

embed_text:
  model_id: default_embedding_model
  vector_store_id: default_vector_store

### Input settings ###

input:
  type: file # or blob
  file_type: text # [csv, text, json]
  base_dir: "input"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # [file, blob, cosmosdb]
  base_dir: "cache"

reporting:
  type: file # [file, blob, cosmosdb]
  base_dir: "logs"

output:
  type: file # [file, blob, cosmosdb]
  base_dir: "output"

### Workflow settings ###

extract_graph:
  model_id: default_chat_model
  prompt: "prompts/extract_graph.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  model_id: default_chat_model
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

extract_graph_nlp:
  text_analyzer:
    extractor_type: regex_english # [regex_english, syntactic_parser, cfg]

extract_claims:
  enabled: false
  model_id: default_chat_model
  prompt: "prompts/extract_claims.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  model_id: default_chat_model
  graph_prompt: "prompts/community_report_graph.txt"
  text_prompt: "prompts/community_report_text.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots:
  graphml: true
  embeddings: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  chat_model_id: default_chat_model
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/drift_search_system_prompt.txt"
  reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/basic_search_system_prompt.txt"


Logs and screenshots

Image

Additional Information

  • GraphRAG Version: v2.2.0
  • Operating System: Windows
  • Python Version: 3.11.0
  • Related Issues:

droideronline avatar Apr 26 '25 19:04 droideronline

Any resolution for this?

rhushikesh avatar May 13 '25 11:05 rhushikesh

I have see it failing it in "prompts/community_report_graph.txt"

rhushikesh avatar May 13 '25 11:05 rhushikesh

Count the number of { and } in extra_graph.txt in the prompt to see if they match Because format {variable} is required

wangsiyu666 avatar Jun 27 '25 02:06 wangsiyu666

We currently run a sanitisation step on prompts after prompt-tuning as a stop gap arrangement, before storing them for indexing. This errors are more common when working with smaller models like mini, which tend to introduce formatting errors. The most common issues we’ve seen are:

  1. Malformed placeholder names – e.g. {tuple_del}, {tuple_delordinal} instead of {tuple_delimiter}.
  2. Broken entity/relationship tuple endings – lines like ("entity"{tuple_delimiter} or ("relationship"{tuple_delimiter} often close incorrectly with }), }, or no ) at all.
  3. Unintended braces around non-allowed variables – e.g. {class} inside entity/relationship tuples, which breaks .format().

Here is the sanitisation func we use and we havent seen any new errors since : `

   def sanitize_prompt(self, template: str) -> str:
    """
    Fixes common issues with .format()-based prompt templates
    """
    ALLOWED_KEYS = {"input_text", "entity_types", "tuple_delimiter", "record_delimiter", "completion_delimiter"}

    # 1. Fix malformed placeholders like {tuple_delordinal}
    template = re.sub(r"\{tuple_de\w*\}?", "{tuple_delimiter}", template)
    template = re.sub(r"\{record_de\w*\}?", "{record_delimiter}", template)
    template = re.sub(r"\{completion_de\w*\}?", "{completion_delimiter}", template)
    def fix_entity_line(line: str) -> str:
        entity_prefixes = ('("entity"{tuple_delimiter}', '("relationship"{tuple_delimiter}')
        if any(line.strip().startswith(prefix) for prefix in entity_prefixes):
            line = line.rstrip()
            if line.endswith('})'):
                line = line[:-2] + ')'
            elif line.endswith('}'):
                line = line[:-1] + ')'
            elif not line.endswith(')'):
                line += ')'
            # Allowed placeholders pattern
            allowed_pattern = r"\{(" + "|".join(ALLOWED_KEYS) + r")\}"
            # Step 1: protect allowed keys with a marker
            def protect(match):
                return f"__KEEP_{match.group(1)}__"
            line = re.sub(allowed_pattern, protect, line)

            # Step 2: strip all remaining { and }
            line = line.replace("{", "").replace("}", "")
            # Step 3: restore allowed keys back into {}
            line = re.sub(
                r"__KEEP_(" + "|".join(re.escape(k) for k in ALLOWED_KEYS) + r")__",
                lambda m: "{" + m.group(1) + "}",
                line,
            )
        return line

    lines = template.splitlines()
    lines = [fix_entity_line(line) for line in lines]
    template = "\n".join(lines)
    return template

`

@natoverse — would appreciate your thoughts or a more robust fix here.

gona-sreelatha avatar Sep 25 '25 13:09 gona-sreelatha