graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

[Bug]: json.decoder.JSONDecodeError when generating Community reports

Open Amitabh-Priyadarshi-Bayer opened this issue 1 year ago • 0 comments

Describe the bug

json={{ "title": "Product Team: Mansz and Jrman" {{ is giving error. I tried to fix the system message for community report. but I found out the error still persists and when I looked into report then it shows that community_report prompt is null.

in setting.yaml, prompt filename for community report is "prompts/community_report.txt" I updated double braces '{{' to single { in "community_report.txt" but it still creating the json with double '{{'.

community_report:
  prompt: "prompts/community_report.txt"
  max_length: 4000
  max_input_length: 12000

Also, in indexing-engine.log in "community_reports" section, its showing "prompt": null and not showing the filename as 'prompts/community_report.txt', which is mentioned in setting.yaml

"community_reports": 
      "async_mode": "threaded",
      "prompt": null,
      "max_length": 2000,
      "max_input_length": 8000,
      "strategy": null

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

encoding_model: cl100k_base skip_workflows: [] llm: api_key: ${GRAPHRAG_API_KEY} type: azure_openai_chat # or azure_openai_chat model: gpt-4-32k (0613) model_supports_json: false

max_tokens: 4000

request_timeout: 180.0

api_base: -removed because of security purpose api_version: '2023-05-15'

organization: <organization_id>

deployment_name: gpt-4-32k

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization: stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio llm: api_key: ${GRAPHRAG_API_KEY} type: azure_openai_embedding model: text-embedding-ada-002 api_base: removed because of security purpose api_version: '2023-05-15' # organization: <organization_id> deployment_name: embedding # tokens_per_minute: 150_000 # set a leaky bucket throttle # requests_per_minute: 10_000 # set a leaky bucket throttle # max_retries: 10 # max_retry_wait: 10.0 # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times # concurrent_requests: 25 # the number of parallel inflight requests that may be made # batch_size: 16 # the number of documents to send in a single request # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request # target: required # or optional

chunks: size: 300 overlap: 100 group_by_columns: [id] # by default, we don't allow chunks to cross documents

input: type: file # or blob file_type: text # or csv base_dir: "input" file_encoding: utf-8 file_pattern: ".*\.txt$"

cache: type: file # or blob base_dir: "cache"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

storage: type: file # or blob base_dir: "output/${timestamp}/artifacts"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

reporting: type: file # or console, blob base_dir: "output/${timestamp}/reports"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

entity_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/entity_extraction.txt" entity_types: [organization,person,geo,event] max_gleanings: 0

summarize_descriptions:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/summarize_descriptions.txt" max_length: 500

claim_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

enabled: true prompt: "prompts/claim_extraction.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 0

community_report:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/community_report.txt" max_length: 4000 max_input_length: 12000

cluster_graph: max_cluster_size: 10

embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes

num_walks: 10

walk_length: 40

window_size: 2

iterations: 3

random_seed: 597832

umap: enabled: false # if true, will generate UMAP embeddings for nodes

snapshots: graphml: false raw_entities: false top_level_nodes: false

local_search:

text_unit_prop: 0.5

community_prop: 0.1

conversation_history_max_turns: 5

top_k_mapped_entities: 10

top_k_relationships: 10

max_tokens: 12000

global_search:

max_tokens: 12000

data_max_tokens: 12000

map_max_tokens: 1000

reduce_max_tokens: 2000

concurrency: 32

Logs and screenshots

20:19:16,982 graphrag.config.read_dotenv INFO Loading pipeline .env file 20:19:16,988 graphrag.index.cli INFO using default configuration: { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "root_dir": "GraphRAG/", "reporting": { "type": "file", "base_dir": "output/${timestamp}/reports", "storage_account_blob_url": null }, "storage": { "type": "file", "base_dir": "output/${timestamp}/artifacts", "storage_account_blob_url": null }, "cache": { "type": "file", "base_dir": "cache", "storage_account_blob_url": null }, "input": { "type": "file", "file_type": "text", "base_dir": "input", "storage_account_blob_url": null, "encoding": "utf-8", "file_pattern": ".*\.txt$", "file_filter": null, "source_column": null, "timestamp_column": null, "timestamp_format": null, "text_column": "text", "title_column": null, "document_attribute_columns": [] }, "embed_graph": { "enabled": false, "num_walks": 10, "walk_length": 40, "window_size": 2, "iterations": 3, "random_seed": 597832, "strategy": null }, "embeddings": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_embedding", "model": "text-embedding-ada-002", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "embedding", "model_supports_json": null, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "batch_size": 16, "batch_max_tokens": 8191, "target": "required", "skip": [], "vector_store": null, "strategy": null }, "chunks": { "size": 300, "overlap": 100, "group_by_columns": [ "id" ], "strategy": null }, "snapshots": { "graphml": false, "raw_entities": false, "top_level_nodes": false }, "entity_extraction": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/entity_extraction.txt", "entity_types": [ "organization", "person", "geo", "event" ], "max_gleanings": 0, "strategy": null }, "summarize_descriptions": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": "prompts/summarize_descriptions.txt", "max_length": 500, "strategy": null }, "community_reports": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "prompt": null, "max_length": 2000, "max_input_length": 8000, "strategy": null }, "claim_extraction": { "llm": { "api_key": "REDACTED, length 32", "type": "azure_openai_chat", "model": "gpt-4-32k (0613)", "max_tokens": 4000, "request_timeout": 180.0, "api_base": "removed because of security purpose", "api_version": "2023-05-15", "proxy": null, "cognitive_services_endpoint": null, "deployment_name": "gpt-4-32k", "model_supports_json": false, "tokens_per_minute": 0, "requests_per_minute": 0, "max_retries": 10, "max_retry_wait": 10.0, "sleep_on_rate_limit_recommendation": true, "concurrent_requests": 25 }, "parallelization": { "stagger": 0.3, "num_threads": 50 }, "async_mode": "threaded", "enabled": true, "prompt": "prompts/claim_extraction.txt", "description": "Any claims or facts that could be relevant to information discovery.", "max_gleanings": 0, "strategy": null }, "cluster_graph": { "max_cluster_size": 10, "strategy": null }, "umap": { "enabled": false }, "local_search": { "text_unit_prop": 0.5, "community_prop": 0.1, "conversation_history_max_turns": 5, "top_k_entities": 10, "top_k_relationships": 10, "max_tokens": 12000, "llm_max_tokens": 2000 }, "global_search": { "max_tokens": 12000, "data_max_tokens": 12000, "map_max_tokens": 1000, "reduce_max_tokens": 2000, "concurrency": 32 }, "encoding_model": "cl100k_base", "skip_workflows": [] }

20:20:39,273 graphrag.index.reporting.file_workflow_callbacks INFO Community Report Extraction Error details=None 20:20:39,273 graphrag.index.verbs.graph.report.strategies.graph_intelligence.run_graph_intelligence WARNING No report found for community: 0 20:20:39,346 httpx INFO HTTP Request: POST --" 20:20:39,347 graphrag.llm.openai.utils ERROR error loading json, json={{ "title": "Application Support Team and Controlled Environment", "summary": "The community revolves around the Application Support Team, which provides assistance to users experiencing problems with the application. The team interacts with various features of the application, including the Controlled Environment, Admin Tab, In-app Support Ticket System, Statuses File, and Summary View.", "rating": 7.0, "rating_explanation": "The impact severity rating is high due to the critical role of the Application Support Team in ensuring smooth operation of the application.", "findings": [ {{ "summary": "Functionality of the Summary View", "explanation": "The Summary View is a customizable section of the application where users can adjust the display of information. The Application Support Team can provide assistance for customizing the Summary View, indicating its complexity and potential for user customization. [Data: Entities (26), Relationships (37)]" }} ]}} Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/graphrag/llm/openai/utils.py", line 93, in try_parse_json_object result = json.loads(input) File "/opt/conda/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/opt/conda/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/opt/conda/lib/python3.10/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1) 20:20:39,349 graphrag.llm.openai.openai_chat_llm WARNING error parsing llm json, retrying 20:20:39,978 httpx INFO HTTP Request: POST https://agvisorapimtest.azure-api.net/openapi-test/openai/deployments/gpt-4-32k/chat/completions?api-version=2023-05-15 "HTTP/1.1 200 OK" 20:20:39,980 graphrag.llm.openai.utils ERROR error loading json, json={output_text} Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/graphrag/llm/openai/openai_chat_llm.py", line 124, in _manual_json json_output = try_parse_json_object(output) File "/opt/conda/lib/python3.10/site-packages/graphrag/llm/openai/utils.py", line 93, in try_parse_json_object result = json.loads(input) File "/opt/conda/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/opt/conda/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/opt/conda/lib/python3.10/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

During handling of the above exception, another exception occurred:

Additional Information

  • GraphRAG Version: 0.1.1
  • Operating System: AWS sagemaker distribution 1.9
  • Python Version: 3.10.14
  • Related Issues: