magnitude
magnitude copied to clipboard
Fix caching calls to `_vector_for_key_cached` and `_out_of_vocab_vector_cached`
-
_query_is_cached
will always returnsFalse
becausekey
should be in a tuple. -
lru_cache
is able to unifyargs
,kwargs
, and default args in a call with theget_default_args
magic in order to generate a consistent cache key. What this means is that a. all the defaultargs
will be part ofkwargs
; b. anyargs
with a default value will also be converted tokwargs
. c. for a parameter that has no default value, if you provide it asargs
in one call and askwargs
in another, they will have different cache keys. Therefore_out_of_vocab_vector_cached._cache.get(((key,), frozenset([('normalized', normalized)])))
will always returnFalse
since the actual cache key is((key,), frozenset([('normalized', normalized), ('force', force)]))
-
It's wasteful to call
_cache.get
and throw away the result. So I changed_query_is_cached
to_query_cached
.
I'm surprised that this PR receives no attention because it improves performance of our service by a large margin. Here is a code snippet to help understand the effect of this change:
from collections import defaultdict
import pandas as pd
from pymagnitude import *
words = ['hello', 'world', 'cache', 'helllloooo', 'wooooorlddddd', 'caaaaache', ]
reversed_words = list(reversed(words))
vector = Magnitude(path=MagnitudeUtils.download_model('glove/medium/glove.twitter.27B.25d', log=True),
language='en',
lazy_loading=2400000)
vector_attrs = ['query', '_vector_for_key_cached', '_out_of_vocab_vector_cached']
def log_cached(vector):
data = defaultdict(list)
cache_attrs = ['size', 'lookups', 'hits', 'misses', 'evictions']
for attr in vector_attrs:
for cache_attr in cache_attrs:
data[cache_attr].append(getattr(getattr(vector, attr)._cache, cache_attr))
df = pd.DataFrame(data, index=vector_attrs)
print(df, '\n')
print('### Query ...')
vector.query(words)
log_cached(vector)
print('### Query reverse ...')
vector.query(reversed_words)
log_cached(vector)
Output before the change:
### Query ...
size lookups hits misses evictions
query 1000 1 0 1 0
_vector_for_key_cached 2400000 6 0 6 0
_out_of_vocab_vector_cached 2400000 9 0 9 0
### Query reverse ...
size lookups hits misses evictions
query 1000 2 0 2 0
_vector_for_key_cached 2400000 12 0 12 0
_out_of_vocab_vector_cached 2400000 18 3 15 0
Output after the change:
### Query ...
size lookups hits misses evictions
query 1000 1 0 1 0
_vector_for_key_cached 2400000 6 0 6 0
_out_of_vocab_vector_cached 2400000 9 0 9 0
### Query reverse ...
size lookups hits misses evictions
query 1000 2 0 2 0
_vector_for_key_cached 2400000 12 6 6 0
_out_of_vocab_vector_cached 2400000 12 3 9 0
I also created https://github.com/zfang/benchmark_pymagnitude just for testing my patch.
Hi @zfang,
Thanks for this PR, this likely broke at some point when I modified the underlying LRU cache. Sorry, I've been on travel for the last week or so, I'll get around to reviewing this tonight and merging this in the next few days.
I'll also add some tests to make sure the cache works and prevent regressions in the future.