fix: optimize knowledge graph clustering for large corpus

Open aubford opened this issue 9 months ago • 0 comments

Current implementation of find_indirect_clusters runs at exponential time because the depth-first search always explores every path in the graph. A kg w/ ~3000 relationships takes over 3 hours on a 2024 M4 Macbook Pro. This brings it down to quad/cubic time relative to testset size (instead of kg relationships). Generating a 100 sample testset of abstract multihop queries for a KG of 1 million relationships takes about 20 seconds.

Added new find_n_indirect_clusters method and applied it to MultiHopAbstractQuerySynthesizer.
No longer searches the entire graph, just enough to get a well-diversified randomized sample of clusters for the desired testset size.
Added lots of tests. I can scale them back. They're mostly to demo the spec so you can easily compare the new behavior with the original find_indirect_clusters which I left in place.
New behavior adds randomization and diversity to cluster sampling.
- Will no longer return subsets along with their supersets, only the superset.
Also fixed the other performance bottleneck in MultiHopAbstractQuerySynthesizer (collecting child nodes).
Eliminates potentially expensive edge-case where LLM calls could be made for every possible cluster in the KG.
In depth details can be found in the find_n_indirect_clusters docstring.

Mar 17 '25 18:03 aubford