LEANN icon indicating copy to clipboard operation
LEANN copied to clipboard

[Question] Best Practice for Updating

Open StreamlinedStartup opened this issue 4 months ago • 9 comments

Forgive me if this is a dumb question, but what is best practice on updating the index as the codebase grows? Does it have intelligent detection in that when re-running indexing it would only add new stuff, or is the Advanced Github Integration the way to go for that?

StreamlinedStartup avatar Aug 13 '25 18:08 StreamlinedStartup

This is an interesting question. I'm not a contributor here but my naive thinking could be that a hook in Claude Code can be made once a custom command like git action (commit) is done. It would be good to incorporate to Roo Code as well.

ww2283 avatar Aug 14 '25 00:08 ww2283

@ww2283 interesting, I totally agree with you, it is a interesting feature to add on

yichuan-w avatar Aug 14 '25 01:08 yichuan-w

Forgive me if this is a dumb question, but what is best practice on updating the index as the codebase grows? Does it have intelligent detection in that when re-running indexing it would only add new stuff, or is the Advanced Github Integration the way to go for that?

@StreamlinedStartup thanks for your interest, we don't have a very solid implementation of auto update index like combining with GitHub actions, but it is a indeed a promising future to work on, stay tuned of our new release, we can work on that!

yichuan-w avatar Aug 14 '25 01:08 yichuan-w

cc @andylizf , we should take a look on this, it is practical

yichuan-w avatar Aug 14 '25 01:08 yichuan-w

Go bears!

StreamlinedStartup avatar Aug 16 '25 00:08 StreamlinedStartup

Go Bears!

yichuan-w avatar Aug 16 '25 00:08 yichuan-w

I would love to have that too. I feel that's one of the current weakness of the system for fast growing codebases (in our company we have 10M lines of code, ~200 commits per day, it means that if I wanted to use the LEANN MCP and have up to date answers, I would have to pretty much re-index the whole codebase every time I pull main, which for now isn't realistic nor practical.

Maybe having new interfaces in the builder like add / remove / update document and maybe adding a timestamp / content hash to the meta_data to be able to see if a document has changed? And doing kind of the same for the graph nodes (insertion / removal / updating a node and its edges)

E.G, making the API more flexible with something like, for documents:

  • ability to add document: process a new file, chunk it, create the graph nodes
  • to remove document: remove all chunks/nodes belonging to that file
  • to update document: re-process file, remove old chunks, add new chunks

And for graph nodes:

  • node Insertion: add chunk to the graph structure
  • node deletion: remove chunk and update connections
  • node update: modify a chunk and reconnect

But it would also need an intelligent system to efficiently diff those changes during build and use those new interfaces when relevant.

gabriel-dehan avatar Aug 17 '25 18:08 gabriel-dehan

Yeah, these features are exactly what we want. I can provide some insights from a vector database background.

For faiss: Only IVFFlat and IVFPQ support both update and delete, so our current HNSW-based method can only support add(note we have not provided the API in LEANN yet, but it should be easy to add. Note: We should turn off recompute here. (For DiskANN, they only support statisdiskindex, so it can not work that well)

For a commercial database: As I have shared in #61, I think pgvector, Qdrant, or whatever can work, but we cannot manage them that transparently; having them is a good substitute plan.

Yeah, also for the cursor, they use turbobuffer as their vectorstore(which uses SPANN, an IVF-based on disk vector store) and use Merkle tree(the same structure in GitHub) to handle update stuff. And ideally, we can make everything ourselves by implementing that in open-source;(also, the choice of vector base is unclear on local device, right now I think Merkle tree+IVFPQ is a wonderful choice cc @andylizf @gabriel-dehan for visibility) it should be easy!

Here are the 4 sub-issues I listed to fully build a powerful code retrieval system

https://github.com/yichuan-w/LEANN/issues/41

yichuan-w avatar Aug 17 '25 19:08 yichuan-w

Also, I think 10M LoC can be like at most 2M chunks, so in that case 2M7684=6G, the storage and RAM overhead is not that significant(so we can use IVFPQ easily), and the storage benefit of LEANN is usually caused by some large personal file rather than code

yichuan-w avatar Aug 17 '25 20:08 yichuan-w