Perf optimizations in map_query_to_entities()

Open mmaitre314 opened this issue 1 year ago • 1 comments

Description

Perf optimizations in map_query_to_entities().

Result of a perf benchmark using PyTest on a list with 1M entities (corresponding roughly to calling map_query_to_entities() with 50K entities and the defaults k = 10 and oversample_scaler = 2):

Original get_entity_by_key(): 2.15s
Optimized get_entity_by_key(): 0.10s
Lookup get_entity_by_id(): 0.00s (below PyTest's duration truncation)

Related Issues

#1275

Proposed Changes

In the default case where embedding_vectorstore_key == EntityVectorStoreKey.ID, use the fact that entities are already stored in a dictionary to perform an O(1) lookup instead of an O(N) scan. The lookup is implemented in a new method called get_entity_by_id().
In the general case, optimize get_entity_by_key() by moving isinstance(), is_valid_uuid(), replace() out of the loop and calling getattr() once instead of twice.

Checklist

[x] I have tested these changes locally.
[x] I have reviewed the code changes.
[ ] I have updated the documentation (if necessary).
[x] I have added appropriate unit tests (if applicable).

Oct 13 '24 21:10 mmaitre314

@microsoft-github-policy-service agree company="Microsoft"

Oct 13 '24 21:10 mmaitre314