string_categorical_encoders icon indicating copy to clipboard operation
string_categorical_encoders copied to clipboard

Do not memorize all entries in MinHashEncoder

Open GaelVaroquaux opened this issue 5 years ago • 2 comments

In its current version, MinHashEncoder uses a dictionary to cache all the inputs. With a very large dataset, this will lead to a memory explosion.

I suggest two modifications:

  • Not storing anything in fit
  • In transform, using an LRUDict (code below), to memoize but with a limited cache.
import collections

class LRUDict:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = collections.OrderedDict()

    def __getitem__(self, key):
        try:
            value = self.cache.pop(key)
            self.cache[key] = value
            return value
        except KeyError:
            return -1

    def __setitem__(self, key, value):
        try:
            self.cache.pop(key)
        except KeyError:
            if len(self.cache) >= self.capacity:
                self.cache.popitem(last=False)
        self.cache[key] = value

GaelVaroquaux avatar Aug 07 '19 14:08 GaelVaroquaux

Cc @twsthomas

GaelVaroquaux avatar Aug 07 '19 14:08 GaelVaroquaux

I have added it on in my PR "Add fast_hash for column_encoder.MinHashing()" #4

cf https://github.com/pcerda/string_categorical_encoders/pull/4/commits/b81f76598b71fe59f5c5ebd2ee62533976475491

TwsThomas avatar Aug 07 '19 16:08 TwsThomas