ragas
ragas copied to clipboard
fix: optimize knowledge graph clustering for large corpus
Current implementation of find_indirect_clusters runs at exponential time because the depth-first search always explores every path in the graph. A kg w/ ~3000 relationships takes over 3 hours on a 2024 M4 Macbook Pro. This brings it down to quad/cubic time relative to testset size (instead of kg relationships). Generating a 100 sample testset of abstract multihop queries for a KG of 1 million relationships takes about 20 seconds.
- Added new find_n_indirect_clusters method and applied it to MultiHopAbstractQuerySynthesizer.
- No longer searches the entire graph, just enough to get a well-diversified randomized sample of clusters for the desired testset size.
- Added lots of tests. I can scale them back. They're mostly to demo the spec so you can easily compare the new behavior with the original find_indirect_clusters which I left in place.
- New behavior adds randomization and diversity to cluster sampling.
- Will no longer return subsets along with their supersets, only the superset.
- Also fixed the other performance bottleneck in MultiHopAbstractQuerySynthesizer (collecting child nodes).
- Eliminates potentially expensive edge-case where LLM calls could be made for every possible cluster in the KG.
- In depth details can be found in the find_n_indirect_clusters docstring.