slow performance of Concrete/Annotable base classes due to object.__setattr__
The Concrete and Annotable Classes in grounds.py, which I expect were written fairly early in the creation of ibis, appear to provide relatively slow performance and I'd like to ask what other approaches could be considered?
Benchmarking & further details below but for the "summary":
- Using
object.__setattr__is ~14-15x slower than direct attribute access and ~4x slower than usingsetattr - Concrete/Annotable are bases for almost every ibis operation as they are bases for e.g.
Value,Node - due to the way ibis works (constructing and then rewriting expression trees), many objects need to be created for each expression (for some complex expressions we have built, this can be 10-100K objects).
- (My actual question 😄 ) What alternative approaches would the team be open to using to reduce this bottleneck? Some options that come to mind, I'm sure there are others I've not thought of
-
- Use a "make instance immutable after
__init__" type approach inImmutableso thatsetattrcould be used in__init__instead ofobject.__setattr__, which would give a 4x improvement (more memory though: one extra attribute per object). This probably needs the least rework of existing code (would affect Immutable, Annotable, Concrete, possibly Slotted).
- (implementation detail: change the setattr method on instances after init to stop "standard" setattr access)
- Use a "make instance immutable after
-
- Make use of python
dataclassesto provide possibly frozen annotatable objects with defaults - would need benchmarking to verify if actually faster than current approach. The type checking done byAnnotablewould still need to be incorporated. Its not clear to what extent dataclasses could be a drop in replacement for Concrete/Annotatble/Immutable, so might/might not need a wider refactor
- Make use of python
-
- Make use of the
koercepackage developed by @kszucs - see #10078
- Make use of the
-
- Make use of another similar package e.g. pydantic - would need benchmarking, evaluation of feature parity etc
-
I would be happy to work on putting together a PR for options a/b above if they were considered viable by the team. Would like to consult opinions of people currently contributing/maintaining e.g. @cpcloud , @NickCrews , @kszucs @deepyaman if you have time? For clarity, I worked on another couple of recent PRs under the account name "hottwaj" which I've now renamed to this one which resembles my actual name :) https://github.com/ibis-project/ibis/issues?q=is%3Apr+author%3AJonAnCla
[details] I came across this while digging into performance building relatively large queries (hundreds of columns, thousands of operations) which can take 5-10s to construct on my laptop. We run some relatively complex ETL operations "on wide but short" tables i.e. "small to medium" size data (typically 100-1000mb), and currently the ibis expression construction time is 10-50% of overall execution time. Not a deal breaker but would prefer it to be <5% if possible :)
A key bottleneck is this line of code which makes use of object.__setattr__ to set each attribute of subclasses of Concrete.
https://github.com/ibis-project/ibis/blob/main/ibis/common/grounds.py#L212
This setup is needed because Concrete is a subclass of Immutable which prevents "normal" setattr usage in order to make instances "immutable" and hashable (they aren't actually immutable, but "more difficult to mutate")
Unfortunately, using object.setattr to set attribute values on objects is ~15x slower than directly using setattr. See outputs and the code snippet its derived from below
Timings (my laptop, python 3.12, ubuntu 24.04)
Direct setattr: 10.2 ns per call
Using setattr: 37.2 ns per call (3.65x)
Object.__setattr__: 150.0 ns per call (14.72x vs direct, 4.04x vs setattr)
Timing code snippet
import timeit
class Foo:
pass
n = 1000000
reps = 5
direct_setattr = sum(timeit.repeat('foo.a = 1',
setup='foo = Foo()', globals={'Foo': Foo}, number=n))/reps
using_setattr = sum(timeit.repeat('setattr(foo, "a", 1)',
setup='foo = Foo()', globals={'Foo': Foo}, number=n))/reps
object_setattr = sum(timeit.repeat('object.__setattr__(foo, "a", 1)',
setup='foo = Foo()', globals={'object': object, 'Foo': Foo}, number=n))/reps
print(f"Direct setattr: {direct_setattr/n*1e9:.1f} ns per call")
print(f"Using setattr: {using_setattr/n*1e9:.1f} ns per call ({using_setattr/direct_setattr:.2f}x)")
print(f"Object.__setattr__: {object_setattr/n*1e9:.1f} ns per call ({object_setattr/direct_setattr:.2f}x)")
A bit frustrating, but some further investigation shows that its hard to make significant improvements as there is also overhead customising the init process which is significant for objects with few attributes
I tried following variants:
import dataclasses
def _prevent_settattr(self, name: str, value: Any) -> None:
raise AttributeError("can't set attribute")
class DirectAttrAccess:
def __init__(self, a: int, b: int) -> None:
#print("DirectAttrAccess __init__")
self.a = a
self.b = b
class Setattr:
def __init__(self, a: int, b: int) -> None:
setattr(self, 'a', a)
setattr(self, 'b', b)
class PostInitImmutable:
def __init__(self, a: int, b: int) -> None:
self.a = a
self.b = b
self.__setattr__ = _prevent_settattr
class PostInitImmutableSetattr:
def __init__(self, a: int, b: int) -> None:
setattr(self, 'a', a)
setattr(self, 'b', b)
self.__setattr__ = _prevent_settattr
class ImmutableDisabledSetattr:
def __init__(self, a: int, b: int) -> None:
object.__setattr__(self, 'a', a)
object.__setattr__(self, 'b', b)
__setattr__ = _prevent_settattr
@dataclasses.dataclass(frozen=True)
class FrozenDataclass:
a: int
b: int
@dataclasses.dataclass
class MutableDataclass:
a: int
b: int
Timing results:
DirectAttrAccess: 268.9 ns per call
Setattr: 418.2 ns per call (1.6x slower vs direct)
PostInitImmutable: 339.0 ns per call (1.3x vs direct, 0.8x vs setattr)
PostInitImmutableSetattr: 543.2 ns per call (2.0x vs direct, 1.3x vs setattr)
ImmutableDisabledSetattr: 860.0 ns per call (3.2x vs direct, 2.1x vs setattr)
FrozenDataclass: 906.4 ns per call (3.4x vs direct, 2.2x vs setattr)
MutableDataclass: 354.6 ns per call (1.3x vs direct, 0.8x vs setattr)
Timing code:
n = 1000000 reps = 5 direct_attr_access = sum(timeit.repeat('cls(1, 2)', globals={'cls': DirectAttrAccess}, number=n))/reps setattr_class = sum(timeit.repeat('cls(1, 2)', globals={'cls': Setattr}, number=n))/reps post_init_immutable = sum(timeit.repeat('cls(1, 2)', globals={'cls': PostInitImmutable}, number=n))/reps post_init_immutable_setattr = sum(timeit.repeat('cls(1, 2)', globals={'cls': PostInitImmutableSetattr}, number=n))/reps immutable_disabled_setattr = sum(timeit.repeat('cls(1, 2)', globals={'cls': ImmutableDisabledSetattr}, number=n))/reps frozen_dataclass = sum(timeit.repeat('cls(1, 2)', globals={'cls': FrozenDataclass}, number=n))/reps mutable_dataclass = sum(timeit.repeat('cls(1, 2)', globals={'cls': MutableDataclass}, number=n))/reps
print(f"DirectAttrAccess: {direct_attr_access/n1e9:.1f} ns per call") print(f"Setattr: {setattr_class/n1e9:.1f} ns per call ({setattr_class/direct_attr_access:.2f}x)") print(f"PostInitImmutable: {post_init_immutable/n1e9:.1f} ns per call ({post_init_immutable/direct_attr_access:.2f}x vs direct, {post_init_immutable/setattr_class:.2f}x vs setattr)") print(f"PostInitImmutableSetattr: {post_init_immutable_setattr/n1e9:.1f} ns per call ({post_init_immutable_setattr/direct_attr_access:.2f}x vs direct, {post_init_immutable_setattr/setattr_class:.2f}x vs setattr)") print(f"ImmutableDisabledSetattr: {immutable_disabled_setattr/n1e9:.1f} ns per call ({immutable_disabled_setattr/direct_attr_access:.2f}x vs direct, {immutable_disabled_setattr/setattr_class:.2f}x vs setattr)") print(f"FrozenDataclass: {frozen_dataclass/n1e9:.1f} ns per call ({frozen_dataclass/direct_attr_access:.2f}x vs direct, {frozen_dataclass/setattr_class:.2f}x vs setattr)") print(f"MutableDataclass: {mutable_dataclass/n*1e9:.1f} ns per call ({mutable_dataclass/direct_attr_access:.2f}x vs direct, {mutable_dataclass/setattr_class:.2f}x vs setattr)")
In summary, improvement could be made by:
- switching to PostInitImmutableSetattr (~0.3x faster improvement for a 2 attribute class).
- but controlling when "freezing" occurs might become a bit fragile: using
metaclass.__call__orcls.__new__to freeze after__init__lead to much worse performance (I didn't include those variants in timings above), so freezing must be done at the end of__init__for any improvement. This would lead to difficulty setting up freezing wheresuper().__init__is used, as the "topmost init function" has to call the freezing function when finished
- but controlling when "freezing" occurs might become a bit fragile: using
- ditching immutability completely
- using MutableDataclass: 0.9x faster
- using Setattr: 0.5x faster
I also tried the above out on python 3.14 (as speed things like building objects, setting attributes change across versions - timings above were for 3.12) and found:
- PostInitImmutableSetattr is 0.5x faster
- MutableDataclass is 1.5x faster
- Setattr is 0.8x faster
I expect that ditching immutability is not particularly desirable... I think that users who mess around with internal objects should probably not be surprised if things break or (more likely) do unexpected stuff, but I guess the current setup also helps provide internal checks that stuff within the library itself is correct...
Maybe another idea could be to make immutability optional so that users could disable it at their own risk? :)
Can you investigate the "shape" of the object tree that is created to get a sense of if there are a few guilty parties that are causing ~90% of the problem? eg you say you are using wide tables. Is that causing the branching factor of the tree to be high, so that even if the tree isn't very deep, it still has a lot of nodes? If so, then perhaps we could do some spot optimizations such as lazily creating the column objects, or perhaps we could skip that work altogether? Or perhaps there is one particular Ops class that we could specialize/optimize?
In general I can help support and review but I don't have the motivation to really dive into the guts of this. I think I would say that any solution must keep immutability by default. If it isn't that ugly maybe make it configurable, but I wouldn't be hopeful. I would also say that a .9x faster improvement isn't worth completely overhauling our implementation.
We can't get rid of mutability as a contract, as most or all of the internals rely on immutability for hashing purposes.
However, we might be able to change the operating principle of using object.__setattr__ to a more performant option.
This would be analogous to creating a class like
class MyTuple(tuple):
def __init__(self, x):
self.x = x
def __hash__(self):
return hash((*self, self.x))
where users (and maybe internals) would have to be very careful not to mutate any attributes.
Thanks both, I'll definitely do a bit more investigation
@NickCrews you mentioned whether there are some specific places that could be targetted. A couple of key places I have a suspicion about are:
- ops.Field: for wide tables, N (N=number of cols) of these objects need to be created for each table-level operation (each mutate/select/join). Creating these very simple objects is comparatively expensive because they inherit from Immutable+Concrete as mentioned above. A possible candidate for specialisation
-
rewrite_project_input- this is called at the end of every .select or .mutate operation and involves selectively replacing some objects in the expression tree. For "simple" columns though it never does anything so skipping this if it can be cheaply detected as not needed would be helpful
I'll look into rewrite_project_input separately and in the meantime come back on this if I can see a not-too-messy path forward
sorry "a suspicion about" is wrong way to put it. Above are both places I see a bottleneck from profiling expression building
Thanks @JonAnCla for looking into it, it is an interesting topic!
Just a quick note that object.__setattr__ actually does a lookup for the method, so preferably we should use the method directly which gives a little speedup:
__object_setattr__ = object.__setattr__
class ImmutableDisabledSetattr:
def __init__(self, a: int, b: int) -> None:
object.__setattr__(self, 'a', a)
object.__setattr__(self, 'b', b)
__setattr__ = _prevent_settattr
class ImmutableDisabledObjectSetattr:
def __init__(self, a: int, b: int) -> None:
__object_setattr__(self, 'a', a)
__object_setattr__(self, 'b', b)
__setattr__ = _prevent_settattr
In [5]: %timeit ImmutableDisabledSetattr(1, 2)
177 ns ± 0.82 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [6]: %timeit ImmutableDisabledObjectSetattr(1, 2)
154 ns ± 0.876 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
I actually use this in koerce https://github.com/kszucs/koerce/blob/fa3f8dcfc56b798acf676a7dba310521e052439a/koerce/annots.py#L570-L573 where the transpiled CPython code makes the difference more pronounced.
I worked on koerce to actually address several challenges you mentioned above while keeping the exiting + extended + fixed behavior of ibis' core. I think it would be more reasonable to add object modeling optimizations to koerce itself since it provides additional benefits. Regarding pydantic koerce is actually twice as fast https://github.com/kszucs/koerce?tab=readme-ov-file#performance (at least at time of writing the readme).
We also do a lot of redundant traversing and object replacement during IR manipulation which we should probably rework to be more similar to other IR rewrite systems available e.g. in MLIR. This is complementary to the features available in koerce.
Thanks @kszucs I distilled out the following snippet from koerce to just cover "Immutable object" type code for comparison to the other examples above so I could benchmark it
%%cython
from typing import Any
import cython
from cython.cimports.cpython.object import PyObject_GenericSetAttr as __setattr__
def new_fast(cls: type, **kwargs: dict[str, Any]):
this = cls.__new__(cls)
for name, value in kwargs.items():
__setattr__(this, name, value)
return this
class Immutable:
def __setattr__(self, name: str, value: Any) -> None:
raise AttributeError("can't set attribute")
for timing
%%timeit
new_fast(Immutable, a=1, b=2)
I got 290ns per object which is a touch faster than what I got for a "MutableDataclass" and comparable with "DirectAttrAccess" (standard class with init and self.x=x setup) i.e. fastest in table above. This koerce/cython approach would have immutability which the other two don't.
Maybe if a way of making an install-time switch to use this could be made, without a lot of upheaval of internals, something like this could be considered for inclusion in ibis (i.e. option to have an "accelerated" build or pure python build)? though that might be quite a big maybe :) The rest of koerce seems like it would be a good performance improvement, but such a big change to these internals seems like it would be a big ask for maintainers at the moment?
What is pretty impressive having looked at all this is what msgspec can do, snippet follows
import msgspec
class MsgspecDataclass(msgspec.Struct, frozen=True):
a: int
b: int
timing
%%timeit
MsgspecDataclass(1, 2)
This timed at 90ns per object on same hardware, about 3 times faster than fastest other approach to building objects so far, and its immutable. So using msgspec for an immutable object replacement would be ~5x faster than current approach taken in ibis.
I've not used msgspec though before so don't its pitfalls - maybe someone else has thoughts. Could investigate with time if others consider it a viable option
Thanks!
@jcrist is the author of msgpec and a fellow maintainer of ibis, so we can directly ask him :)