magnitude icon indicating copy to clipboard operation
magnitude copied to clipboard

Fix caching calls to `_vector_for_key_cached` and `_out_of_vocab_vector_cached`

Open zfang opened this issue 6 years ago • 2 comments

  1. _query_is_cached will always returns False because key should be in a tuple.

  2. lru_cache is able to unify args, kwargs, and default args in a call with the get_default_args magic in order to generate a consistent cache key. What this means is that a. all the default args will be part of kwargs; b. any args with a default value will also be converted to kwargs. c. for a parameter that has no default value, if you provide it as args in one call and as kwargs in another, they will have different cache keys. Therefore _out_of_vocab_vector_cached._cache.get(((key,), frozenset([('normalized', normalized)]))) will always return False since the actual cache key is ((key,), frozenset([('normalized', normalized), ('force', force)]))

  3. It's wasteful to call _cache.get and throw away the result. So I changed _query_is_cached to _query_cached.

zfang avatar Feb 07 '19 05:02 zfang

I'm surprised that this PR receives no attention because it improves performance of our service by a large margin. Here is a code snippet to help understand the effect of this change:

from collections import defaultdict

import pandas as pd
from pymagnitude import *

words = ['hello', 'world', 'cache', 'helllloooo', 'wooooorlddddd', 'caaaaache', ]
reversed_words = list(reversed(words))

vector = Magnitude(path=MagnitudeUtils.download_model('glove/medium/glove.twitter.27B.25d', log=True),
                   language='en',
                   lazy_loading=2400000)
vector_attrs = ['query', '_vector_for_key_cached', '_out_of_vocab_vector_cached']


def log_cached(vector):
    data = defaultdict(list)
    cache_attrs = ['size', 'lookups', 'hits', 'misses', 'evictions']
    for attr in vector_attrs:
        for cache_attr in cache_attrs:
            data[cache_attr].append(getattr(getattr(vector, attr)._cache, cache_attr))
    df = pd.DataFrame(data, index=vector_attrs)
    print(df, '\n')


print('### Query ...')
vector.query(words)
log_cached(vector)

print('### Query reverse ...')
vector.query(reversed_words)
log_cached(vector)

Output before the change:

### Query ...
                                size  lookups  hits  misses  evictions
query                           1000        1     0       1          0
_vector_for_key_cached       2400000        6     0       6          0
_out_of_vocab_vector_cached  2400000        9     0       9          0 

### Query reverse ...
                                size  lookups  hits  misses  evictions
query                           1000        2     0       2          0
_vector_for_key_cached       2400000       12     0      12          0
_out_of_vocab_vector_cached  2400000       18     3      15          0 

Output after the change:

### Query ...
                                size  lookups  hits  misses  evictions
query                           1000        1     0       1          0
_vector_for_key_cached       2400000        6     0       6          0
_out_of_vocab_vector_cached  2400000        9     0       9          0 

### Query reverse ...
                                size  lookups  hits  misses  evictions
query                           1000        2     0       2          0
_vector_for_key_cached       2400000       12     6       6          0
_out_of_vocab_vector_cached  2400000       12     3       9          0 

I also created https://github.com/zfang/benchmark_pymagnitude just for testing my patch.

zfang avatar Feb 09 '19 06:02 zfang

Hi @zfang,

Thanks for this PR, this likely broke at some point when I modified the underlying LRU cache. Sorry, I've been on travel for the last week or so, I'll get around to reviewing this tonight and merging this in the next few days.

I'll also add some tests to make sure the cache works and prevent regressions in the future.

AjayP13 avatar Feb 12 '19 01:02 AjayP13