[Issue]: KeyError: "title" when generating community reports using build_index()
Do you need to file an issue?
- [x] I have searched the existing issues and this bug is not already filed.
- [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the issue
When running build_index using the Python API (Graphrag v2.2.0, Python 3.11.0), the process fails during community report extraction due to a missing "title" key in the prompt template formatting
Steps to reproduce
- Set up Graphrag v2.2.0 with Python 3.11.
- Run build_index() function via Python API.
- Observe error when creating community reports.
GraphRAG Config Used
### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/
### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
models:
default_chat_model:
type: azure_openai_chat # or openai_chat
api_base: https://xxxxx.com
api_version: 2023-07-01-preview
auth_type: api_key # or azure_managed_identity
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
model: gpt-4-turbo-preview
deployment_name: gpt-4o-mini-2024-07-18 # deployment name is modal name
# encoding_model: cl100k_base # automatically set by tiktoken if left undefined
model_supports_json: true # recommended if this is available for your model.
concurrent_requests: 25 # max number of simultaneous LLM requests allowed
async_mode: threaded # or asyncio
retry_strategy: native
max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response)
tokens_per_minute: 0 # set to 0 to disable rate limiting
requests_per_minute: 0 # set to 0 to disable rate limiting
default_embedding_model:
type: azure_openai_embedding # or openai_embedding
api_base: https://xxxxx.com
api_version: 2023-07-01-preview
auth_type: api_key # or azure_managed_identity
api_key: ${GRAPHRAG_API_KEY}
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
model: text-embedding-3-small
deployment_name: text-embedding-ada-002
# encoding_model: cl100k_base # automatically set by tiktoken if left undefined
model_supports_json: true # recommended if this is available for your model.
concurrent_requests: 25 # max number of simultaneous LLM requests allowed
async_mode: threaded # or asyncio
retry_strategy: native
max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response)
tokens_per_minute: 0 # set to 0 to disable rate limiting
requests_per_minute: 0 # set to 0 to disable rate limiting
vector_store:
default_vector_store:
type: lancedb
db_uri: output\lancedb
container_name: default
overwrite: True
embed_text:
model_id: default_embedding_model
vector_store_id: default_vector_store
### Input settings ###
input:
type: file # or blob
file_type: text # [csv, text, json]
base_dir: "input"
chunks:
size: 1200
overlap: 100
group_by_columns: [id]
### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided
cache:
type: file # [file, blob, cosmosdb]
base_dir: "cache"
reporting:
type: file # [file, blob, cosmosdb]
base_dir: "logs"
output:
type: file # [file, blob, cosmosdb]
base_dir: "output"
### Workflow settings ###
extract_graph:
model_id: default_chat_model
prompt: "prompts/extract_graph.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
summarize_descriptions:
model_id: default_chat_model
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
extract_graph_nlp:
text_analyzer:
extractor_type: regex_english # [regex_english, syntactic_parser, cfg]
extract_claims:
enabled: false
model_id: default_chat_model
prompt: "prompts/extract_claims.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
model_id: default_chat_model
graph_prompt: "prompts/community_report_graph.txt"
text_prompt: "prompts/community_report_text.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
umap:
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)
snapshots:
graphml: true
embeddings: false
### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query
local_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/local_search_system_prompt.txt"
global_search:
chat_model_id: default_chat_model
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
drift_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/drift_search_system_prompt.txt"
reduce_prompt: "prompts/drift_search_reduce_prompt.txt"
basic_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/basic_search_system_prompt.txt"
Logs and screenshots
Additional Information
- GraphRAG Version: v2.2.0
- Operating System: Windows
- Python Version: 3.11.0
- Related Issues:
Any resolution for this?
I have see it failing it in "prompts/community_report_graph.txt"
Count the number of { and } in extra_graph.txt in the prompt to see if they match Because format {variable} is required
We currently run a sanitisation step on prompts after prompt-tuning as a stop gap arrangement, before storing them for indexing. This errors are more common when working with smaller models like mini, which tend to introduce formatting errors. The most common issues we’ve seen are:
- Malformed placeholder names – e.g. {tuple_del}, {tuple_delordinal} instead of {tuple_delimiter}.
- Broken entity/relationship tuple endings – lines like ("entity"{tuple_delimiter} or ("relationship"{tuple_delimiter} often close incorrectly with }), }, or no ) at all.
- Unintended braces around non-allowed variables – e.g. {class} inside entity/relationship tuples, which breaks .format().
Here is the sanitisation func we use and we havent seen any new errors since : `
def sanitize_prompt(self, template: str) -> str:
"""
Fixes common issues with .format()-based prompt templates
"""
ALLOWED_KEYS = {"input_text", "entity_types", "tuple_delimiter", "record_delimiter", "completion_delimiter"}
# 1. Fix malformed placeholders like {tuple_delordinal}
template = re.sub(r"\{tuple_de\w*\}?", "{tuple_delimiter}", template)
template = re.sub(r"\{record_de\w*\}?", "{record_delimiter}", template)
template = re.sub(r"\{completion_de\w*\}?", "{completion_delimiter}", template)
def fix_entity_line(line: str) -> str:
entity_prefixes = ('("entity"{tuple_delimiter}', '("relationship"{tuple_delimiter}')
if any(line.strip().startswith(prefix) for prefix in entity_prefixes):
line = line.rstrip()
if line.endswith('})'):
line = line[:-2] + ')'
elif line.endswith('}'):
line = line[:-1] + ')'
elif not line.endswith(')'):
line += ')'
# Allowed placeholders pattern
allowed_pattern = r"\{(" + "|".join(ALLOWED_KEYS) + r")\}"
# Step 1: protect allowed keys with a marker
def protect(match):
return f"__KEEP_{match.group(1)}__"
line = re.sub(allowed_pattern, protect, line)
# Step 2: strip all remaining { and }
line = line.replace("{", "").replace("}", "")
# Step 3: restore allowed keys back into {}
line = re.sub(
r"__KEEP_(" + "|".join(re.escape(k) for k in ALLOWED_KEYS) + r")__",
lambda m: "{" + m.group(1) + "}",
line,
)
return line
lines = template.splitlines()
lines = [fix_entity_line(line) for line in lines]
template = "\n".join(lines)
return template
`
@natoverse — would appreciate your thoughts or a more robust fix here.