ragas icon indicating copy to clipboard operation
ragas copied to clipboard

fix: optimize knowledge graph clustering for large corpus

Open aubford opened this issue 9 months ago • 0 comments

Current implementation of find_indirect_clusters runs at exponential time because the depth-first search always explores every path in the graph. A kg w/ ~3000 relationships takes over 3 hours on a 2024 M4 Macbook Pro. This brings it down to quad/cubic time relative to testset size (instead of kg relationships). Generating a 100 sample testset of abstract multihop queries for a KG of 1 million relationships takes about 20 seconds.

  • Added new find_n_indirect_clusters method and applied it to MultiHopAbstractQuerySynthesizer.
  • No longer searches the entire graph, just enough to get a well-diversified randomized sample of clusters for the desired testset size.
  • Added lots of tests. I can scale them back. They're mostly to demo the spec so you can easily compare the new behavior with the original find_indirect_clusters which I left in place.
  • New behavior adds randomization and diversity to cluster sampling.
    • Will no longer return subsets along with their supersets, only the superset.
  • Also fixed the other performance bottleneck in MultiHopAbstractQuerySynthesizer (collecting child nodes).
  • Eliminates potentially expensive edge-case where LLM calls could be made for every possible cluster in the KG.
  • In depth details can be found in the find_n_indirect_clusters docstring.

aubford avatar Mar 17 '25 18:03 aubford