Merge entities
Description
This pull request introduces an optional workflow called merge_entities, which can be run after the extract_graph workflow. It aims to merge duplicate or near-duplicate entities (e.g., car and cars, or PCA and principal component analysis) in the entity and relationship tables.
Motivation
Currently, Graphrag may extract entities that are semantically similar but not identical. These duplicates increase the number of sparse or fragmented nodes in the knowledge graph and may negatively affect community detection and other downstream tasks.
By merging these entities, the graph becomes more semantically compact and meaningful, with improved structure and potentially better community coherence.
I created a graph about the soldering process. In this graph, You can see that without merging entities "Increased board complexity" was a separate fragment, and no community report was created but after merging entities, it is connected to the main node "soldering" and a community is created.
Proposed Changes Add a new optional merge_entities workflow Add config for merge_entities workflow (i.e. enable: true/false, ....) Add workflow to default workflows Add merge_entities prompt Add a JSON log file of llm output to the output folder
Checklist
- ✅ I have tested these changes locally.
- ✅ I have reviewed the code changes.
- ❌ I have updated the documentation (if necessary).
- ❌ I have added appropriate unit tests (if applicable).
I really appreciate it if you provide me with some feedback and if you think this is a good feature I will work on document and unit tests.
Here are some examples of merged entities:
SOLDER Merged from: SOLDER, MOLTEN SOLDER, SOLDER JOINTS, SOLDER JOINT, SOLDERED JOINT
CLEANING Merged from: CLEANING, CLEANING PROCESSES, CLEANING PROCESS
WAVE SOLDERING Merged from: WAVE SOLDERING, CS (WAVE SOLDERING) PROCESS
MACHINE SOLDERING Merged from: MACHINE SOLDERING, SOLDERING MACHINE
@microsoft-github-policy-service agree