[Fatal Bug]: Incorrect deduplication of entities with same title but different type

Open IT-Bill opened this issue 10 months ago • 1 comments

Do you need to file an issue?

[x] I have searched the existing issues and this bug is not already filed.
[x] My model is hosted on OpenAI or Azure. If not, please look at the "model providers" issue and don't file a new one here.
[x] I believe this is a legitimate bug, not just a question. If this is a question, please use the Discussions area.

Describe the bug

In the finalize_entities method of Graphrag, there is an issue when handling duplicate entities. The method groups entities by title and type, and merges their descriptions into a list. This list is then passed to the model for summarization. However, in the finalize_entities method, the drop operation does not account for cases where nodes have the same title but different types. As a result, when duplicates are removed, only the first node with the same title is kept, and the other nodes are discarded.

Steps to reproduce

Extract multiple entities with the same title but different types.
Call the finalize_entities method to process them.
After processing, notice that only the first node with the same title is kept, and the others are discarded.

Expected Behavior

The expected behavior is that nodes with the same title but different types should be handled correctly during deduplication, rather than keeping only the first node and discarding the rest.

GraphRAG Config Used

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

models:
  default_chat_model:
    api_key: ${OPENAI_API_KEY} # set this in the generated .env file
    type: openai_chat # or azure_openai_chat
    auth_type: api_key # or azure_managed_identity
    model: gpt-4o-mini-2024-07-18
    model_supports_json: true # recommended if this is available for your model.
    parallelization_num_threads: 50
    parallelization_stagger: 0.3
    async_mode: threaded # or asyncio
    # audience: "https://cognitiveservices.azure.com/.default"
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
  default_embedding_model:
    api_key: ${OPENAI_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    auth_type: api_key # or azure_managed_identity
    model: text-embedding-3-large
    parallelization_num_threads: 50
    parallelization_stagger: 0.3
    async_mode: threaded # or asyncio
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>

vector_store:
  default_vector_store:
    type: lancedb
    db_uri: output\lancedb
    container_name: default
    overwrite: True

embed_text:
  model_id: default_embedding_model
  vector_store_id: default_vector_store

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # [file, blob, cosmosdb]
  base_dir: "cache"

reporting:
  type: file # [file, blob, cosmosdb]
  base_dir: "logs"

output:
  type: file # [file, blob, cosmosdb]
  base_dir: "output"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_output:
  # type: file # [file, blob, cosmosdb]
  # base_dir: "update_output"

### Workflow settings ###

extract_graph:
  model_id: default_chat_model
  prompt: "prompts/extract_graph.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  model_id: default_chat_model
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

extract_graph_nlp:
  text_analyzer:
    extractor_type: regex_english # [regex_english, syntactic_parser, cfg]

extract_claims:
  enabled: false
  model_id: default_chat_model
  prompt: "prompts/extract_claims.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  model_id: default_chat_model
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots:
  graphml: true
  embeddings: true

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"
  reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search:
  prompt: "prompts/basic_search_system_prompt.txt"

Logs and screenshots

I save the entities before merge.

You can see that two descriptions about PARENT AREA are missing.

Additional Information

GraphRAG Version: 1.2.0
Operating System: Windows 10
Python Version: 3.12.9
Related Issues:

Feb 18 '25 03:02 IT-Bill

Could you please provide an estimated timeline for fixing the issue? Thanks.

Feb 26 '25 03:02 IT-Bill