flake8-bugbear icon indicating copy to clipboard operation
flake8-bugbear copied to clipboard

Recommend Counter() in place of defaultdict(int)

Open peterjc opened this issue 2 years ago • 5 comments
trafficstars

I am suggesting a new check to recommend replacing defaultdict(int) with Counter(), both of which are from the standard library collections, due to the former having a major gotcha with memory usage.

Consider this test case:

from collections import Counter
from collections import defaultdict

# Profiled using;
# $ mprof run defaultdict_leak.py && mprof plot
if True:
    # Apparent memory leak
    # Over 700MB using Python 3.9.13 on macOS
    # Over 900MB using Python 3.9.15 on Linux
    counts = defaultdict(int)
else:
    # Low and flat memory
    # about 16MB using Python 3.9.13 on macOS
    # about 19MB using Python 3.9.15 on Linux
    counts = Counter()

# Setup some sparse data
for x in [12,34,56,78,90]:
    counts[str(x)] = 123 + x

threshold = 100
excesses = 0
for x in range(int(1e7)):
    if threshold < counts[str(x)]:
        excesses += counts[str(x)]
print(f"Sum of entries exceeding the threshold {threshold} is {excesses}")

The apparent memory leak with defaultdict is due to the documented behaviour of recording the missing keys in the dictionary with the default value, https://docs.python.org/3/library/collections.html#collections.defaultdict says:

... provide a default value for the given key, this value is inserted in the dictionary for the key, and returned.

...

When each key is encountered for the first time, it is not already in the mapping; so an entry is automatically created using the default_factory ...

I assume I'm not the only person to have used defaultdict without appreciating this, and in the case of defaultdict(int) there is good alternative in Counter()

peterjc avatar Dec 16 '22 12:12 peterjc

Howdy - Good suggestion. TIL about Counter and I use defaultdict(int) a lot. With it being a dict subclass what fixes the memory? https://docs.python.org/3/library/collections.html#collections.Counter - Couldn't see anything noted here ... I guess the bug is in defaultdict itself and not dict (otherwise there would be leaks all over python as dicts are used everywhere internally).

Has the memory problem been fixed in later pythons (i.e. in >= 3.10) with defaultdict? If not, has a bug been raised? I'm torn with this one as it feels we could get it fixed in cpython, but maybe recommending it could be worth it. Would love others opinions here ...

cooperlees avatar Dec 16 '22 16:12 cooperlees

I've not confirmed on recent Python, but assume that like Python 3.9 they do the same - i.e. they probably behave the same way as per the current documentation. I don't think it is a bug, just a potentially surprising feature.

I can understand if the calling factory function to make default objects were expensive then the current behaviour may have advantages, but with defaultdict(int) then I see no benefit - only a memory leak (the unwanted keys and zero values).

peterjc avatar Dec 16 '22 17:12 peterjc

I'm happy to help with implementing this one. As it's not clear bug and just a recommendation, maybe it should be disabled by default?

dejvidq avatar Mar 16 '23 08:03 dejvidq

If this is accepted as in scope, I agree it falls under the B9## opinionated warnings rather than being on by default. Thanks!

peterjc avatar Mar 16 '23 11:03 peterjc

Yup - I'll accept an opinionated for this - Lets just make sure we share the memory savings etc. - Thanks!

cooperlees avatar Mar 16 '23 16:03 cooperlees

Please have a look at #489 if it's what you meant.

I checked the code given below on python 3.10 and 3.12 and in both cases using Counter instead of defaultdict(int) resulted in reducing memory usage from ~1000MB to flat ~20 MB

Btw. I'm sorry it took me so much time to finally have a look at it 😞

dejvidq avatar Aug 15 '24 21:08 dejvidq

Implemented in #489

cooperlees avatar Aug 20 '24 01:08 cooperlees