graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

Perf optimizations in map_query_to_entities()

Open mmaitre314 opened this issue 1 year ago • 1 comments

Description

Perf optimizations in map_query_to_entities().

Result of a perf benchmark using PyTest on a list with 1M entities (corresponding roughly to calling map_query_to_entities() with 50K entities and the defaults k = 10 and oversample_scaler = 2):

  • Original get_entity_by_key(): 2.15s
  • Optimized get_entity_by_key(): 0.10s
  • Lookup get_entity_by_id(): 0.00s (below PyTest's duration truncation)

Related Issues

#1275

Proposed Changes

  • In the default case where embedding_vectorstore_key == EntityVectorStoreKey.ID, use the fact that entities are already stored in a dictionary to perform an O(1) lookup instead of an O(N) scan. The lookup is implemented in a new method called get_entity_by_id().
  • In the general case, optimize get_entity_by_key() by moving isinstance(), is_valid_uuid(), replace() out of the loop and calling getattr() once instead of twice.

Checklist

  • [x] I have tested these changes locally.
  • [x] I have reviewed the code changes.
  • [ ] I have updated the documentation (if necessary).
  • [x] I have added appropriate unit tests (if applicable).

mmaitre314 avatar Oct 13 '24 21:10 mmaitre314

@microsoft-github-policy-service agree company="Microsoft"

mmaitre314 avatar Oct 13 '24 21:10 mmaitre314