[Feature Request]: Synchronize the update of entity and relation descriptions upon document deletion.
Feature Request Description
LightRAG relies significantly on entity and relation summaries to generate accurate responses to queries; therefore, only removing chunks from a deleted document can lead to hallucinations.
If we retain all original entity and relation descriptions along with the corresponding document IDs from which these descriptions originate, we will be able to re-aggregate the descriptions after a specific document ID has been deleted.
@danielaskdd Can you clearly explain where I should start and what the steps are? I need a well-defined roadmap
@danielaskdd I really need this feature, but at the moment, I don’t know where to start. I’m not very experienced with code, so understanding the entire project to build on top of it is quite challenging for me. I would really appreciate any support, or at least some initial guidance to help me move forward. Thank you so much!
Currently, the community's resources are still insufficient, and there are many important matters that take priority, such as:
- Allowing users to customize and switch prompt templates
- Automatically merging synonymous entities
- Supporting multiple workspaces (corpuses)
I believe the delete operation is more critical. Adding documents only incurs a embedding cost, and the other features don’t significantly affect current system workflows. However, deleting data can have a major impact — it may disrupt queries, break relationships, or compromise data integrity if not handled properly. That’s why delete should be prioritized and implemented with extra care.
@LarFii What is your perspective on establishing work priorities?
Hi @danielaskdd ,
I do not under your "only removing chunks from a deleted document can lead to hallucinations". I think using the data without obsolete information is better than using "chunks that should be removed if it's obsolete".
I think it's important to perform update & delete once document / code is changed. I have also build my project code graph using AST and trying to figure out how to delete it.
A plausible approach might be:
- Removing all nodes that linked to one code file (say using the doc_id)
- Removing all edges that starting from the nodes found above
- Removing all edges that ending to the nodes found above and reprocess the code file containing the starting nodes of these edges.
If part of the code call dependency is not changed, it should be able to construct back the node and build the edge to the callee that previously connected to (For point 2). Reprocessing other files previously link to this to-be-delete document should be able to build the relationship correctly too (For point 3).
I understand my example is call dependency is has clear one-way direction. Also, as it's code call dependency, it would be easier to find the involved code file to reprocess. But for other free-text document, such as requirement, I need to read the source of this project and understand more how other free text document nodes & edges are constructed.
Took a quick scan on method adelete_by_doc_id and looks like it have done everything I mentioned above. I would love to help if possible and it would be great you can explain your idea more in depth.
The issue is being resolved; a version that reconstructs entities and relationships based on an LLM cache has been developed and is currently undergoing testing: https://github.com/HKUDS/LightRAG/tree/delete_doc
Hi @danielaskdd,
I scan through the code and found that many places using "get all" approach to load all info. I wonder if this would cause significant performance issue once the KB increases through time pass. I think at least should get the info by doc_id or other explicit id wherever possible.
@kenspirit PR #1732 #1729 solved the get_all problem.