[Bug]: Auto prompt tuning - ValueError: Single '}' encountered in format string
Do you need to file an issue?
- [x] I have searched the existing issues and this bug is not already filed.
- [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
Hello,
During auto prompt tuning, GraphRAG generates a knowledge graph output that has bugs:
Bug: knowledge graph is not valid, because the number of } is more than {.
Steps to reproduce
- Init graphrag
- provide some paragraphs from this PDF: https://kpmg.com/kpmg-us/content/dam/kpmg/frv/pdf/2024/handbook-revenue-recognition-1224.pdf
- run prompt tuning
You will see this error:
Traceback (most recent call last):
File ".../pypoetry/virtualenvs/service-vector-embedding-6NKDQ0ig-py3.11/lib/python3.11/site-packages/graphrag/index/operations/extract_graph/graph_extractor.py", line 127, in __call__
result = await self._process_document(text, prompt_variables)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../pypoetry/virtualenvs/service-vector-embedding-6NKDQ0ig-py3.11/lib/python3.11/site-packages/graphrag/index/operations/extract_graph/graph_extractor.py", line 156, in _process_document
self._extraction_prompt.format(**{
ValueError: Single '}' encountered in format string
and when I look at the extract_graph.txt I see the issue. For example, see here (there are 15 { but there are 19 } - look at the extra } in advance}) for example)
("entity"{tuple_delimiter}HOSTING SERVICE FEES{tuple_delimiter}cost types{tuple_delimiter}Fees for hosting services, charged at $100 per month, paid in advance})
{record_delimiter}
("entity"{tuple_delimiter}REMAINING TERM OF THE HOSTING ARRANGEMENT{tuple_delimiter}lease arrangements{tuple_delimiter}The duration left on the hosting arrangement from the go-live date, which is 5 years})
{record_delimiter}
("entity"{tuple_delimiter}GO-LIVE DATE{tuple_delimiter}implementation details{tuple_delimiter}The date when the cloud-based solution became operational, which is January 1, Year 3})
{record_delimiter}
("entity"{tuple_delimiter}CAPITALIZED IMPLEMENTATION COSTS – PAYROLL MODULE{tuple_delimiter}cost types{tuple_delimiter}The costs incurred to implement the payroll processing module, amounting to $400, which are capitalized})
Expected Behavior
The extract_graph.txt should have equal number of { and } and free of errors
GraphRAG Config Used
models:
default_chat_model:
type: openai_chat
auth_type: api_key
api_key: ${GRAPHRAG_API_KEY}
model: gpt-4-turbo-preview
model_supports_json: true
concurrent_requests: 25
async_mode: threaded
retry_strategy: native
max_retries: -1
tokens_per_minute: 0
requests_per_minute: 0
default_embedding_model:
type: openai_embedding
auth_type: api_key
api_key: ${GRAPHRAG_API_KEY}
model: text-embedding-3-small
model_supports_json: true
concurrent_requests: 25
async_mode: threaded
retry_strategy: native
max_retries: -1
tokens_per_minute: 0
requests_per_minute: 0
vector_store:
default_vector_store:
type: lancedb
db_uri: output/lancedb
container_name: default
overwrite: true
embed_text:
model_id: default_embedding_model
vector_store_id: default_vector_store
input:
type: file
file_type: json
base_dir: input
text_column: page_content
title_column: title
metadata:
- page
- data_type
- figures
chunks:
size: 1200
overlap: 100
group_by_columns:
- id
cache:
type: file
base_dir: cache
reporting:
type: file
base_dir: logs
output:
type: file
base_dir: output
extract_graph:
model_id: default_chat_model
prompt: prompts/extract_graph.txt
entity_types:
- organization
- trademark
- publication
- standard
max_gleanings: 1
summarize_descriptions:
model_id: default_chat_model
prompt: prompts/summarize_descriptions.txt
max_length: 500
extract_graph_nlp:
text_analyzer:
extractor_type: regex_english
extract_claims:
enabled: false
model_id: default_chat_model
prompt: prompts/extract_claims.txt
description: Any claims or facts that could be relevant to information discovery.
max_gleanings: 1
community_reports:
model_id: default_chat_model
graph_prompt: prompts/community_report_graph.txt
text_prompt: prompts/community_report_text.txt
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false
umap:
enabled: false
snapshots:
graphml: false
embeddings: false
local_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: prompts/local_search_system_prompt.txt
global_search:
chat_model_id: default_chat_model
map_prompt: prompts/global_search_map_system_prompt.txt
reduce_prompt: prompts/global_search_reduce_system_prompt.txt
knowledge_prompt: prompts/global_search_knowledge_system_prompt.txt
drift_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: prompts/drift_search_system_prompt.txt
reduce_prompt: prompts/drift_search_reduce_prompt.txt
basic_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: prompts/basic_search_system_prompt.txt
Logs and screenshots
Additional Information
- GraphRAG Version: 2.1.0
- Operating System: Linux
- Python Version: 3.11.2
- Related Issues:
I confirm this is a bug, as I tested by removing the extra } characters in extract_graph.txt, the indexing proceeds with no bug.
Please try again with version 2.2.1, which includes updates to the prompt template to resolve the format call removing too many braces
This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.
I am facing another issue upon updating:
Traceback (most recent call last):
File "/home/myuser/.cache/pypoetry/virtualenvs/service-vector-embedding-6NKDQ0ig-py3.11/lib/python3.11/site-packages/graphrag/index/operations/summarize_communities/community_reports_extractor.py", line 76, in __call__
prompt = self._extraction_prompt.format(**{
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '\n "title"'
17:53:13,739 graphrag.callbacks.file_workflow_callbacks INFO Community Report Extraction Error details=None
17:53:13,739 graphrag.index.operations.summarize_communities.strategies WARNING No report found for community: 260.0
17:53:13,739 graphrag.index.operations.summarize_communities.community_reports_extractor ERROR error generating community report
Traceback (most recent call last):
File "/home/myuser/.cache/pypoetry/virtualenvs/service-vector-embedding-6NKDQ0ig-py3.11/lib/python3.11/site-packages/graphrag/index/operations/summarize_communities/community_reports_extractor.py", line 76, in __call__
prompt = self._extraction_prompt.format(**{
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '\n "title"'
17:53:13,740 graphrag.callbacks.file_workflow_callbacks INFO Community Report Extraction Error details=None
17:53:13,740 graphrag.index.operations.summarize_communities.strategies WARNING No report found for community: 261.0
17:53:13,740 graphrag.index.operations.summarize_communities.community_reports_extractor ERROR error generating community report
the contents of the related files (i show only part of it):
community_report_graph.txt
parsed by json.loads.
{
"title": <report_title>,
"summary": <executive_summary>,
"rating": <impact_severity_rating>,
"rating_explanation": <rating_explanation>,
"findings": [
community_report_text.txt
parsed by json.loads.
{{
"title": "<report_title>",
"summary": "<executive_summary>",
"rating": <importance_rating>,
"rating_explanation": "<rating_explanation>",
"findings": [{{"summary":"<insight_1_summary>", "explanation":
This issue has been marked stale due to inactivity after repo maintainer or community member responses that request more information or suggest a solution. It will be closed after five additional days.
This issue has been closed after being marked as stale for five days. Please reopen if needed.