[Feature Request]: Automatic merging of the same entity under different names
Background
LightRAG currently merges entities solely based on exact name matches (including captions). This results in multiple disconnected nodes for the same entity under different names, and may even create isolated subgraphs for identical entities, ultimately degrading query performance.
Automated Entity Merging for Variant Names
To address this, we propose an automated entity merging approach for differently named but identical entities:
-
Vector Node Database Utilization:
- Modify node vector DB implementation to store the embedded vector on entity name.
-
Similarity Threshold Configuration:
- Set a minimum cosine similarity threshold (e.g., 0.8) for candidate selection.
-
Candidate Retrieval:
- During merging, retrieve the top 10 most relevant nodes based on cosine similarity (above the threshold).
-
LLM-Based Merge Validation:
- Submit the current entity’s name/description along with candidate entities’ names/descriptions to an LLM.
- Task the LLM to:
- Determine whether merging is justified,
- If merging is approved, select a best candidate for merging, and return the consolidated entity name and description.
-
Iterative Merging With Depth Limitation (optional):
- Repeat the merging validation process for the newly consolidated entity returned by the LLM.
@LarFii Does this algorithm seem reasonable to you? Do you have any suggestions for improvement?
I might have something to note about this implementation. I have done something similar to this for my own LightRAG. For my own lightrag before a chunk gets send to the AI we first do a Hybrid search with query_param.context_only = true and I modified the build_context to only send back entities and relationships (so not the original chunks). This then gets added to the "entity_extraction" prompt as extra information (After the examples because this is better for caching tokens).
This way I can send exisiting knowledge graph information to the AI and it can add relationships between exisiting data. The thing I noticed is that lets say I am uploading 4 chunks at the same time the second chunk wont be able to retrieve any of the extracted relationships/entities of the first chunk because they have yet to be uploaded.
That is why my implementation is now a lot slower as it processes chunks 1 by 1, but I now have a lot more usefull/necessary relations in my knowledge graph and also less duplicates.
As for what you mentioned I have some questions:
- Lets say i have a knowledge graph with 10.000 nodes. Does this mean that if I want to merge the nodes the application will possibly start 10.000 calls?
- What if one entity matches 10 or more other entities, will it merge all of them?
- Lets say 20 documents mention the same thing about the same entity. Then this entity will get a lot of repetative description information. Does this also take that into account?
- And I noticed that with each entity/relationship the file name gets send so the AI can reference the file it came from. Is there a way to turn this off?
Node merging is a complex task. This issue is addressing to merge the most obviously same entity in the document indexing stage:
- Perform merge checks after all entities are extracted from a document.
- Limit candidate selection to the top n most similar nodes with a strict threshold to minimize unnecessary LLM calls for false positives.
Given this workflow, node auto-merging should be enabled from the very beginning of document indexing, so there won't be a situation of merging hundreds of nodes at once.
Additionally, disabling citations is worth considering, and we should introduce a query_param option to control this behavior.
I have an idea: for specific vertical domains, during data cleaning, LLM should first use prompts to standardize synonyms, and then feedback the missing synonyms back to the standardized list during vector processing. This can significantly reduce the workload of later mergers
I have an idea: for specific vertical domains, during data cleaning, LLM should first use prompts to standardize synonyms, and then feedback the missing synonyms back to the standardized list during vector processing. This can significantly reduce the workload of later mergers
Good idea! The next version will support multiple versions of prompts, allowing users to flexibly select prompt versions. This can address the issue of domain synonyms you mentioned.
Is this feature currently being implemented?
I am using a form of manual merging with a simple python app found here. Look for the file _1_merge_GUI_??.py where the question marks stand in for the version. The app uses the LightRAG API to collect and list all the entities in the knowledge graph. Then I can filter on a substring to look for things that seem similar.
In this case I need a way to quickly understand if cytochromes and cytoplasm similar enough subjects that they can be merged. Because I can see them side by side it is immediately evident that they are not similar and should not be merged. Otherwise, I wouldn't know because I am not familiar with the subject. But if I pull up items that are determined to be more or less the same then the application will merge the entities with the strategy I select. I can also edit the entity_type at the same time.
First I use the LightRAG API to get a list of every entity. Then I ask any ai like Gemini or ChatGPT to look through the list for things which might be similar in the list. It will point out things in the list like Melanoma, melanoma, and skin cancer. It would seem that they should all be merged, but when looking at the three side by side in my app, it becomes apparent that melanoma is only one form of skin cancer. So "Melanoma" and "melanoma" are selected for a merge while "skin cancer" is not included in the merge. We know that a.i. is not currently able to reliably figure out what should and should not be merged by virtue of the fact that skin cancer was suggested along with melanoma for the merge. So a human in the loop is required. Still, I would never have found those two items for merge consideration if it were not for the a.i. pointing it out. So a.i. is required for this operation as well.
Soon I will add functionality to my merge app that collects data about my merge decisions which will be used to fine-tune an LLM to understand my preferences when merging data so that it can be done autonomously.
The work goes very quickly with my app except for the fact that I need to shut down the LightRAG server at the console with (ctrl-c) and then restart it again in order to see the changes in my simple merge app. This slows down the work significantly and I asked here if there was an API command that I could automatically send out from my merge app after a merge that would cause the LightRAG server to refresh its data from disk.
This merging operation is also a wonderful opportunity to select meaningful entity_types for each entity other than just person, geo, event, organization, and category. My merge app already uses the LightRAG API to collect and edit this data but I will be adding more functionality to it.
The manual merging app mentioned above now easily does the following:
- Edit entity name
- Edit entity description
- Edit entity type
- Edit entity relationships
- Add new entity relationships
- Delete entities
- Delete entity relationships
- Show all entities of a particular category
- Show all entities that have not relations (orphans)
- Show all information about selected entities and their relations side by side with other selected entities in order to understand what operations from above need to be performed in order to clean up the data.
A big help for me is to use the API to get a list of all the entities. Then I give this list to any a.i. such as Grok, or Gemini and ask them to look over the list and recommend candidates for merging. This catches all the duplicates which are written in different cases like "Melanoma" and "melanoma" and it also catches pairs like "melanoma" and "skin cancer". The substring filter is also a great help in identifying candidates for a merge, rename, delete, or a new relationship.
Next I will see what can be done with the prompt at indexing time in order to ingest cleaner data so that less of the above will be required. After that experiment I will see if I can use the above application to collect training data in order to fine-tune the LLM we use for indexing.
Hello, is node merging a feature now? I am not sure how to use the branch merge_dev for this purpose, or if documentation is available yet.
Thank you
The merge_dev branch is currently under development. @LarFii
Hello,
Thanks again for LightRAG, and it's true that being able to merge nodes would be great.
I wonder, rather than trying to automate this with AI, wouldn't it be “simpler” to be able to click on several nodes and have a merge button in the WebUI? Like for deleting?
Or even when renaming a node in the same way as another, instead of getting an error message, having a dialog box that offers to merge or cancel the renaming.
I don't realize what this would entail, I'm just suggesting it ^^
Have a good day and thanks again.
Greetings, @Konsilion,
It is in my plan to make a new tab in the WebUI and port over the application seen above in this thread. Currently the app is written in python so it will need to be ported to JavaScript. 58 versions were required to discover what exactly I needed it to do but now I know exactly what I want so the conversion will go quickly.
The app is found here. Look for the file 1_merge_GUI??.py where the question marks stand in for the version number.
If someone gets to this before I do, that will be nice too. Otherwise I will get to it in the next month or two.
Greetings, @Konsilion,
It is in my plan to make a new tab in the WebUI and port over the application seen above in this thread. Currently the app is written in python so it will need to be ported to JavaScript. 58 versions were required to discover what exactly I needed it to do but now I know exactly what I want so the conversion will go quickly.
The app is found here. Look for the file 1_merge_GUI??.py where the question marks stand in for the version number.
If someone gets to this before I do, that will be nice too. Otherwise I will get to it in the next month or two.
Hi @danielaskdd,
Thank you for sharing your plan and the details about the GUI app. Before I consider helping with the port to JavaScript and integration into the WebUI, could you clarify whether this work is part of the official LightRAG roadmap and will be merged into the main repository (ex: "merge-dev" branch) ? Or is this planned as a separate fork or unofficial extension?
Personally, I prefer to stay focused on the official branch and roadmap of LightRAG, so I’d appreciate your guidance on how this fits with the project’s direction.
Thanks in advance!
Greetings @Konsilion
There seems to be some confusion. I am the person that made the python app for editing the LightRAG database and who intends to bring that functionality to the WebUI and make it open source like the python app is already. It was necessary for me to make this app because I work with video transcripts which have less reliable data, compared to documents, because of pronunciation issues.
I am not part of the LightRAG development team but @danielaskdd has given me assistance in my own work with LightRAG and I am deeply appreciative.
The LightRAG development team is free to use what I produce including modifications to the WebUI, just like anyone else, but there has been no communication about it.
any news?