Are `__slots__` benefits still "significant"?
Documentation
The __slots__ documentation states:
The space saved over using
__dict__can be significant. Attribute lookup speed can be significantly improved as well.
In meticulous benchmarks, I've found the benefits (both memory and lookup speed) to be more modest in Python versions 3.11 and later. Perhaps the documentation should be updated to reflect this? Or maybe I'm missing something? Details below.
Memory Savings
Due to the optimizations to object layout introduced in Python 3.11, the memory savings from using __slots__ seem to be reduced. The exact savings depend on the number of attributes defined in the class. Below is a table showing the lower and upper bounds of memory savings of using __slots__ over __dict__ (in bytes per object instance).
| version | min. memory savings | max memory savings |
|---|---|---|
| 3.9 | 80 | 216 |
| 3.10 | 80 | 216 |
| 3.11 | 40 | 64 |
| 3.12 | 32 | 56 |
| 3.13 | 40 | 64 |
(see bottom of the post for the benchmarking code used to generate these numbers)
Lookup speed
Two observations:
- The speed boost is modest on Python 3.11 and later.
- The speed boost is negligible when using an optimized Python build
(
--enable-optimizations --with-lto).
| version | speed boost | speed boost (optimized Python build) |
|---|---|---|
| 3.9 | 62% | 4% |
| 3.10 | 52% | 8% |
| 3.11 | 13% | 4% |
| 3.12 | 15% | 0% |
| 3.13 | 3% | 3% |
The benchmarking code
The code can be found here: https://github.com/ariebovenberg/slots-bench
edit: formatting
If we agree to change the docs, I'd be happy to submit a PR
64 bytes can still be significant, for example with thousands of small objects. It's good to see that the overhead is decreasing, though.
It may be worth changing the attribute lookup point, if the results are truly within noise now.
Do we have good benchmarks for __slots__ in the pyperformance suite? This could be a good candidate for both speed and memory.
A
After a quick scan of pyperformance, I don't see any slots benchmarks (although __slots__ are mentioned in a page about descriptors). I'd be happy to adjust the benchmarks for inclusion in pyperformance.
We use __slots__ in the float benchmark in pyperformance https://github.com/python/pyperformance/blob/main/pyperformance/data-files/benchmarks/bm_float/run_benchmark.py
@Fidget-Spinner Ah yes, but perhaps a dedicated benchmark comparing __slots__ to __dict__ in the context of attribute lookup might be appropriate? What do you think?
Regarding the docs, how about placing a "note" box like this after the "significant" remark:
[!NOTE] Modern Python (3.11+) Considerations Python 3.11 introduced substantial optimizations to the underlying object model, making
__dict__more efficient. This reduces the previously dramatic gains from__slots__:
- Memory: Expect memory savings of roughly 32-64 bytes per instance, rather than the prior 80-240 bytes range.
- Performance: Attribute lookup speed improvements are typically marginal (0-5%), often falling within the margin of error for typical benchmarks when using an optimized Python build.
Using a dedicated "Note" section offers two key advantages over changing the main text:
- Explains the "Why": It clearly explains that the updated figures are due to optimizations in Python 3.11+, providing valuable context.
- Maintains Relevance for Older Pythons: It keeps the original text meaningful for developers using Python 3.10 and earlier, where
__slots__offers more dramatic benefits. This prevents misleading users who might not look up older documentation versions.
It can useful for namedtuples & data classes if they use it. 64 bytes is still worth the performance (for data points for instance). So I wouldn't say "expect X to be saved instead of prior Y" but rather "this saves memory" and we can perhaps remove the "significant" and/or say that it depends. For instance, for small classes with few attributes and/or namedtuples, it's quite good (I haven't run the numbers but I would say it's quite good; hopefully I'm not wrong). For huge classes, it's not worth it I guess?
Using a dedicated "Note" section offers two key advantages over changing the main text:
The problem with notes is that people will read them first before reading the surrounding text. A note is good but unfortunately draws attention to much and sometimes highlights something that wasn't meant to be highlighted that much.
Concerning the access attributes, I see that your benchmarks consists of classes with 10 attributes. Sometimes classes have many more attributes (maybe 30) or very few attributes (3) in which case I expect the results to be different (I don't know by how much they will be different though, maybe I'm overthinking). And also, sometimes classes inherit others, so we should also benchmark the effect of inherited slots.
👍 Agree that there will always be certain situations they're useful. Especially the empty __slots__ = () is useful (or even necessary) when inheriting from builtin types.
The problem with notes is that people will read them first before reading the surrounding text.
Hah, this is exactly why I'm in favor of this of course 😁. Having spent so much time tinkering around with slots, the news that __slots__ benefits are reduced should IMHO jump out.
Sometimes classes have many more attributes (maybe 30) or very few attributes (3) in which case I expect the results to be different
I tried for 2 and 26 attributes and here are the results. Note I didn't go as far as 30, since that's the cap at which the object layout changes kick in. So above 30 attributes the differences are significant.
26 attributes
| Benchmark | getattr-dict-opt | getattr-slots-opt |
|---|---|---|
| getattr (3.13.3) | 4.24 ns | 4.15 ns: 1.02x faster |
| getattr (3.12.11) | 4.28 ns | 4.42 ns: 1.03x slower |
| getattr (3.11.13) | 4.79 ns | 4.60 ns: 1.04x faster |
| getattr (3.10.18) | 14.8 ns | 13.1 ns: 1.13x faster |
| getattr (3.9.23) | 13.2 ns | 11.6 ns: 1.13x faster |
2 attributes
| Benchmark | getattr-dict-opt | getattr-slots-opt |
|---|---|---|
| getattr (3.13.3) | 4.35 ns | 4.19 ns: 1.04x faster |
| getattr (3.12.11) | 4.55 ns | 4.88 ns: 1.07x slower |
| getattr (3.11.13) | 4.57 ns | 4.36 ns: 1.05x faster |
| getattr (3.10.18) | 14.6 ns | 13.4 ns: 1.09x faster |
| getattr (3.9.23) | 13.5 ns | 12.6 ns: 1.07x faster |
we should also benchmark the effect of inherited slots.
I tried this as well, using this setup:
class _Base1:
__slots__ = ("a", "b", "c", "d")
class _Base2(_Base1):
__slots__ = ("e", "f", "g", "h")
class _Base3(_Base2):
__slots__ = ()
class A(_Base3):
__slots__ = ("i", "j")
but didn't find any meaningful differences.
edit: typo
An extra remark about inheritance: if done improperly (i.e. mixing __dict__ and __slots__, see #135385), the memory footprint and lookup speed is affected. But such inheritance is in most cases an oversight of forgetting to set __slots__ somewhere.
edit: typo
Note I didn't go as far as 30, since that's the cap at which the object layout changes kick in. So above 30 attributes the differences are significant.
In this case, we can keep the "significant" part. However, I'm not fond of indicating the threshold when the number of attributes becomes relevant because it's an implementation details. But OTOH, for very small classes, the attribute access improvements are really not significant.
So, here's what I can suggest:
- We should mention that the memory improvements are noticeable, especially for classes with few attributes. We don't precise what "few" means.
- We should mention that this may slightly improve attribute access for medium-sized classes. With 2 attributes (namely with an already very small
__dict__compared toobject), we only gain 4%. I wouldn't consider it as noise but it's no more significant. We should no more say that the improvements are significant, except for classes with more than 30 attributes. Ideally, we should make that "30" a dynamic variable (using substitutions and rst prologs) but that may be an overkill. If possible, find where this "30" is coming from and indicate in the C files that this number must be kept in sync with the docs.
As for whether to use a note or not, I think we can just rephrase the introductory sentence and add the information I just mentioned. It will be less intrusive while still being in a place where the reader's attention is still up.
Would a CPython implementation detail note be more appropriate? Inlined dicts are not a guarantee of the language, for other implementations __slots__ may continue to have large benefits for small number of attributes.
A
A CPython implementation would be good if other implementations may benefit from __slots__.
I'm not fond of indicating the threshold when the number of attributes becomes relevant because it's an implementation details
I can see the advantage of not mentioning exact numbers in the docs. I'm fine either way. In any case I'll leave my benchmarking repo open if anybody is curious.
Would a CPython implementation detail note be more appropriate?
That's an interestion option. My only concern is that __slots__ fundamentally relies on the implementation details, so it feels disingenuous to me to call the benefits "significant" as a blanket statement. IIRC pypy doesn't benefit from __slots__ at all. I'm not sure what graalpython does.
The interpreter continues to be in flux, so the relative speeds between dotted access with and without slots changes over time. Currently, Tools/scripts/var_access_benchmark shows about a 10% improvement on a stock macOS build.
Recommend this be closed. The word "significant" is hedged by the words "can be improved" instead of "is improved". We do know that some teams at FB/Instagram found it significant enough to use slots by default for performance reasons.
Conceptually, it should always be possible for slots to have a performance advantage over instance dicts. One reason is that space savings tends improves cache utilitization. On modern systems, smaller objects tend to be faster. Another reason is that guaranteed semantics require that slots be checked before looking in the instance dict:
def object_getattribute(obj, name):
"Emulate PyObject_GenericGetAttr() in Objects/object.c"
null = object()
objtype = type(obj)
cls_var = find_name_in_mro(objtype, name, null)
descr_get = getattr(type(cls_var), '__get__', null) # <--- CHECKED FIRST
if descr_get is not null:
if (hasattr(type(cls_var), '__set__')
or hasattr(type(cls_var), '__delete__')):
return descr_get(cls_var, obj, objtype) # data descriptor
if hasattr(obj, '__dict__') and name in vars(obj): # <--- CHECKED LATER
return vars(obj)[name] # instance variable
if descr_get is not null:
return descr_get(cls_var, obj, objtype) # non-data descriptor
if cls_var is not null:
return cls_var # class variable
raise AttributeError(name)
Side note: There are some additional benefits not mentioned.
- Slots can confer early error detection for misspelled attributes:
inst.pool_siz = inst.pool.size + 1. - Slots also helps readability by making clear exactly which attributes are stored.
- Slots can specifically allow or disallow weak references.
- Slots is often used with _private variables to make objects more immutable:
__slots__ = ['_mu', '_sigma'] - Slots can be a dictionary that allows docstrings for the attributes visible as to help().
Addenda: Lots of performance considerations in PyPy are completely different (it is a world where function call overhead is vastly reduced while generator/iterator performance is weak). It is okay to let that project document the differences.
The interpreter continues to be in flux, so the relative speeds between dotted access with and without slots changes over time.
Agree that the performance benefits of __slots__ over __dict__ remain. Would you be open to adding a note that these benefits are substantially less than they used to be (pre-3.11)? Why I'm pushing this: there’s an abundance of outdated articles about __slots__, and very little awareness of the optimizations in 3.11+. Readers of the documentation would be forgiven for mistakingly thinking Python’s __dict__ is still just as “inefficient” as when these docs were written.
Side note: There are some additional benefits not mentioned.
Yes! Don't get me wrong I love slots 😉
edit: apologies, I see the ticket was closed as I was typing
Fwiw the benchmarks are probably not that representative to tell if the attribute lookup speed up for __slots__ is just 3% or bigger. (Not to mention that 3% may be considered a "significant speedup" for some players.)
This is because the benchmark only measures accesses to multiple attributes of a single object and we should take into consideration that a slotted object will get the pointer to attribute "right away" while the nonslotted object needs to do one more memory dereference. In other words, in C this would look like this:
- For
slotted.attr:
PyObject* attr = (PyObject*)(&slotted + attr_offset)
- For
notslotted.attr:
PyObject* attr = (PyObject*)((*(values_dict*)(¬slotted + dict_values_offset)) + attr_offset)
Now, taking this into account, if you had many many objects spread across memory and you would access them (their attribute) randomly, the additional dereference will trash (or utilize) the CPU cache much more than in the linked benchmark code. As a result, you will very likely see a bigger performance improvement for lookup speed than with the current benchmark.