[Fatal Bug]: Incorrect deduplication of entities with same title but different type
Do you need to file an issue?
- [x] I have searched the existing issues and this bug is not already filed.
- [x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
- [x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.
Describe the bug
In the finalize_entities method of Graphrag, there is an issue when handling duplicate entities. The method groups entities by title and type, and merges their descriptions into a list. This list is then passed to the model for summarization. However, in the finalize_entities method, the drop operation does not account for cases where nodes have the same title but different types. As a result, when duplicates are removed, only the first node with the same title is kept, and the other nodes are discarded.
Steps to reproduce
- Extract multiple entities with the same
titlebut differenttypes. - Call the
finalize_entitiesmethod to process them. - After processing, notice that only the first node with the same
titleis kept, and the others are discarded.
Expected Behavior
The expected behavior is that nodes with the same title but different types should be handled correctly during deduplication, rather than keeping only the first node and discarding the rest.
GraphRAG Config Used
### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/
### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
models:
default_chat_model:
api_key: ${OPENAI_API_KEY} # set this in the generated .env file
type: openai_chat # or azure_openai_chat
auth_type: api_key # or azure_managed_identity
model: gpt-4o-mini-2024-07-18
model_supports_json: true # recommended if this is available for your model.
parallelization_num_threads: 50
parallelization_stagger: 0.3
async_mode: threaded # or asyncio
# audience: "https://cognitiveservices.azure.com/.default"
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
default_embedding_model:
api_key: ${OPENAI_API_KEY}
type: openai_embedding # or azure_openai_embedding
auth_type: api_key # or azure_managed_identity
model: text-embedding-3-large
parallelization_num_threads: 50
parallelization_stagger: 0.3
async_mode: threaded # or asyncio
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
vector_store:
default_vector_store:
type: lancedb
db_uri: output\lancedb
container_name: default
overwrite: True
embed_text:
model_id: default_embedding_model
vector_store_id: default_vector_store
### Input settings ###
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\\.txt$$"
chunks:
size: 1200
overlap: 100
group_by_columns: [id]
### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided
cache:
type: file # [file, blob, cosmosdb]
base_dir: "cache"
reporting:
type: file # [file, blob, cosmosdb]
base_dir: "logs"
output:
type: file # [file, blob, cosmosdb]
base_dir: "output"
## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_output:
# type: file # [file, blob, cosmosdb]
# base_dir: "update_output"
### Workflow settings ###
extract_graph:
model_id: default_chat_model
prompt: "prompts/extract_graph.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
summarize_descriptions:
model_id: default_chat_model
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
extract_graph_nlp:
text_analyzer:
extractor_type: regex_english # [regex_english, syntactic_parser, cfg]
extract_claims:
enabled: false
model_id: default_chat_model
prompt: "prompts/extract_claims.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
model_id: default_chat_model
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
umap:
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)
snapshots:
graphml: true
embeddings: true
### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query
local_search:
prompt: "prompts/local_search_system_prompt.txt"
global_search:
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
drift_search:
prompt: "prompts/drift_search_system_prompt.txt"
reduce_prompt: "prompts/drift_search_reduce_prompt.txt"
basic_search:
prompt: "prompts/basic_search_system_prompt.txt"
Logs and screenshots
I save the entities before merge.
You can see that two descriptions about PARENT AREA are missing.
Additional Information
- GraphRAG Version: 1.2.0
- Operating System: Windows 10
- Python Version: 3.12.9
- Related Issues:
Could you please provide an estimated timeline for fixing the issue? Thanks.