[Bug]: `KeyError: 'reports'` when doing Local Search
Do you need to file an issue?
- [X] I have searched the existing issues and this bug is not already filed.
- [ ] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [X] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
I am indexing and querying with graphrag library using a OpenAI compatible version of vLLM. With tools, functions and embeddings.
Everything works as expected, running the local and global search notebooks against the constructed index.
However when trying to get the report in the local search notebook, I get the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[17], line 1
----> 1 result.context_data["reports"].head()
KeyError: 'reports'
Steps to reproduce
No response
Expected Behavior
to get a report similar to the one I get in the global notebook.
GraphRAG Config Used
encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: text-assistant
model_supports_json: true # recommended if this is available for your model.
# max_tokens: 4000
# request_timeout: 180.0
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# temperature: 0 # temperature for sampling
# top_p: 1 # top-p sampling
# n: 1 # Number of completions to generate
parallelization:
stagger: 0.3
# num_threads: 50 # the number of threads to use for parallel processing
async_mode: threaded # or asyncio
embeddings:
## parallelization: override the global parallelization settings for embeddings
async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: text-embedding-3-small
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
# batch_size: 16 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional
chunks:
size: 1200
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$"
cache:
type: file # or blob
base_dir: "cache"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"
# connection_string: <azure_blob_storage_connection_string>
# container_name: <azure_blob_storage_container_name>
entity_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/entity_extraction.txt"
entity_types: [organization, person, geo, event]
max_gleanings: 1
summarize_descriptions:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
# enabled: true
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
## llm: override the global llm settings for this task
## parallelization: override the global parallelization settings for this task
## async_mode: override the global async_mode settings for this task
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
# num_walks: 10
# walk_length: 40
# window_size: 2
# iterations: 3
# random_seed: 597832
umap:
enabled: false # if true, will generate UMAP embeddings for nodes
snapshots:
graphml: false
raw_entities: false
top_level_nodes: false
local_search:
# text_unit_prop: 0.5
# community_prop: 0.1
# conversation_history_max_turns: 5
# top_k_mapped_entities: 10
# top_k_relationships: 10
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
global_search:
# llm_temperature: 0 # temperature for sampling
# llm_top_p: 1 # top-p sampling
# llm_n: 1 # Number of completions to generate
# max_tokens: 12000
# data_max_tokens: 12000
# map_max_tokens: 1000
# reduce_max_tokens: 2000
# concurrency: 32
Logs and screenshots
No response
Additional Information
- GraphRAG Version: 3.0.3
- Operating System: Fedora 40 KDE
- Python Version: 3.11.9
If you look at the log logs in the output folder, you will find that there is an error when the community report is generated at the end, and I also encountered this problem
If you look at the log logs in the output folder, you will find that there is an error when the community report is generated at the end, and I also encountered this problem
Nope no error logs on anything.
You can see this file graphrag-0.2.2/output/20240812-090029/reports/indexing-engine.log
im using 0.3.0 but yes. no errors!
@l4b4r4b4b4 Because the result.context_data[] from the local search only contains the entities, relationships, and sources fields, while the global search includes result.context_data['reports'].
@l4b4r4b4b4 Because the
result.context_data[]from the local search only contains theentities,relationships, andsourcesfields, while the global search includesresult.context_data['reports'].
ok, so its simply a mistake that reports are called in the local jupyter notebook?
@l4b4r4b4b4 Because the
result.context_data[]from the local search only contains theentities,relationships, andsourcesfields, while the global search includesresult.context_data['reports'].ok, so its simply a mistake that reports are called in the local jupyter notebook?
Yes, you just need to comment out that line, and local search will run normally.
You can comment out the community reports to get unblocked, but we do expect them for local search results. Some things to check:
- Does the notebook run fine with the example parquets?
- Does your create_final_community_reports.parquet look reasonable compared to the example?
- Does the CLI query work when pointed at your index, or is this issue clearly isolated to the example notebook?
You can comment out the community reports to get unblocked, but we do expect them for local search results. Some things to check:
- Does the notebook run fine with the example parquets?
- Does your create_final_community_reports.parquet look reasonable compared to the example?
- Does the CLI query work when pointed at your index, or is this issue clearly isolated to the example notebook?
I just ran the local search example notebook again with the example data and it worked as expected.