string_categorical_encoders
string_categorical_encoders copied to clipboard
Do not memorize all entries in MinHashEncoder
In its current version, MinHashEncoder uses a dictionary to cache all the inputs. With a very large dataset, this will lead to a memory explosion.
I suggest two modifications:
- Not storing anything in fit
- In transform, using an LRUDict (code below), to memoize but with a limited cache.
import collections
class LRUDict:
def __init__(self, capacity):
self.capacity = capacity
self.cache = collections.OrderedDict()
def __getitem__(self, key):
try:
value = self.cache.pop(key)
self.cache[key] = value
return value
except KeyError:
return -1
def __setitem__(self, key, value):
try:
self.cache.pop(key)
except KeyError:
if len(self.cache) >= self.capacity:
self.cache.popitem(last=False)
self.cache[key] = value
Cc @twsthomas
I have added it on in my PR "Add fast_hash for column_encoder.MinHashing()" #4
cf https://github.com/pcerda/string_categorical_encoders/pull/4/commits/b81f76598b71fe59f5c5ebd2ee62533976475491