flake8-bugbear
flake8-bugbear copied to clipboard
Recommend Counter() in place of defaultdict(int)
I am suggesting a new check to recommend replacing defaultdict(int) with Counter(), both of which are from the standard library collections, due to the former having a major gotcha with memory usage.
Consider this test case:
from collections import Counter
from collections import defaultdict
# Profiled using;
# $ mprof run defaultdict_leak.py && mprof plot
if True:
# Apparent memory leak
# Over 700MB using Python 3.9.13 on macOS
# Over 900MB using Python 3.9.15 on Linux
counts = defaultdict(int)
else:
# Low and flat memory
# about 16MB using Python 3.9.13 on macOS
# about 19MB using Python 3.9.15 on Linux
counts = Counter()
# Setup some sparse data
for x in [12,34,56,78,90]:
counts[str(x)] = 123 + x
threshold = 100
excesses = 0
for x in range(int(1e7)):
if threshold < counts[str(x)]:
excesses += counts[str(x)]
print(f"Sum of entries exceeding the threshold {threshold} is {excesses}")
The apparent memory leak with defaultdict is due to the documented behaviour of recording the missing keys in the dictionary with the default value, https://docs.python.org/3/library/collections.html#collections.defaultdict says:
... provide a default value for the given key, this value is inserted in the dictionary for the key, and returned.
...
When each key is encountered for the first time, it is not already in the mapping; so an entry is automatically created using the default_factory ...
I assume I'm not the only person to have used defaultdict without appreciating this, and in the case of defaultdict(int) there is good alternative in Counter()
Howdy - Good suggestion. TIL about Counter and I use defaultdict(int) a lot. With it being a dict subclass what fixes the memory? https://docs.python.org/3/library/collections.html#collections.Counter - Couldn't see anything noted here ... I guess the bug is in defaultdict itself and not dict (otherwise there would be leaks all over python as dicts are used everywhere internally).
Has the memory problem been fixed in later pythons (i.e. in >= 3.10) with defaultdict? If not, has a bug been raised? I'm torn with this one as it feels we could get it fixed in cpython, but maybe recommending it could be worth it. Would love others opinions here ...
I've not confirmed on recent Python, but assume that like Python 3.9 they do the same - i.e. they probably behave the same way as per the current documentation. I don't think it is a bug, just a potentially surprising feature.
I can understand if the calling factory function to make default objects were expensive then the current behaviour may have advantages, but with defaultdict(int) then I see no benefit - only a memory leak (the unwanted keys and zero values).
I'm happy to help with implementing this one. As it's not clear bug and just a recommendation, maybe it should be disabled by default?
If this is accepted as in scope, I agree it falls under the B9## opinionated warnings rather than being on by default. Thanks!
Yup - I'll accept an opinionated for this - Lets just make sure we share the memory savings etc. - Thanks!
Please have a look at #489 if it's what you meant.
I checked the code given below on python 3.10 and 3.12 and in both cases using Counter instead of defaultdict(int) resulted in reducing memory usage from ~1000MB to flat ~20 MB
Btw. I'm sorry it took me so much time to finally have a look at it 😞
Implemented in #489