distributed
distributed copied to clipboard
Consider using object pools for internal state keeping
Dask is creating many small objects and not just for very large graphs but this is pretty much a built in thing. This is one of not the most prominent reason why dask was originally built as a "tuple of tuple of tuple ..." machinery. Even the scheduler internally only adopted the usage of custom classes only a couple of years back because instantiation can be very costly. However, modern python versions have gotten much better at managing overhead so this is negligible for most classes.
The one type of objects we're still affected by, both in terms of memory but also in terms of runtime are sets. Yes, sets! (I might share profiles but this is mostly an issue to preseve an idea). One way to work around instantiation cost of sets is to use an object pool design. Effectively, we'd "disable" garbage collection and would resurrect objects on finalization in a way that would allow us to reuse them.
A minimal version of this would look like that
import sys
class SetPool:
def __init__(self):
self._sets = []
def add(self, obj):
obj.clear()
self._sets.append(obj)
def get(self):
try:
new = self._sets.pop()
return new
except:
return None
def stored_size(self):
return sum(map(sys.getsizeof, self._sets))
globalpool = SetPool()
class PooledSet(set):
def __new__(cls, *iterables):
obj = globalpool.get()
if obj is not None:
return obj
return super().__new__(cls, *iterables)
def __del__(self):
print(f"Resurrecting {id(self)}")
globalpool.add(self)
This could be expanded on need, e.g. by hinting towards whether this would be an empty set, a very large set or a somewhat normal one.
Haven't tried out what the actual impact would be but it is a fun concept that could help if we actually want/need to optimize for this