marshmallow
marshmallow copied to clipboard
[RFC] set_class = OrderedSet by default?
I assume the performance impact would be minimal. From a quick look, it would only affect schema instantiation.
The benefit is that when using Python 3.7+, all schemas would be ordered. They still wouldn't return OrderedDict
instances unless ordered
Meta is passed, but they would keep fields declaration order in their output and in their schema list, so that order would be respected in API responses and apispec users would get their fields documented in declaration order.
Currently, users can achieve this in a base schema with
class OrderedBaseSchema(Schema):
set_class = OrderedSet
My gut feeling is that this change would have no functional downside and a negligible perf impact at field init (less that we care about when adding a new feature), while using ordered=True
does have an impact on serialization performance (I didn't measure it) and functionality (equality test depends on order), hence my proposal. There would still be an interest in allowing the use of OrderedDict
.
If we go this route, we could use OrderedSet
by default in marshmallow 3.x, and in marshmallow 4 perhaps rework the ordered
feature because schemas would already be ordered so it would be specifically for users really needing an OrderedDict
instance. Perhaps drop the ordered
Meta option and let the user specify a dict class of their choice another way. But that's another story.
See https://github.com/marshmallow-code/flask-smorest/issues/228#issuecomment-777836226.
I'm not opposed to this, so long as there is no perf impact (I think your intuition on this is right).
Or since this is easily achieved in user code thanks to a base schema as shown above, we postpone this to marshmallow 4 as described in my last paragraph.
Adding v4 milestone for now.
using
ordered=True
does have an impact on serialization performance
I believe that is caused by OrderedDict
. None of the OrderedSet
instance attributes are ever accessed during dump()
, only the OrderedDict
dump_fields
. Even though dict
is ordered now, it is still faster than OrderedDict
. Waiting until 3.7+ would avoid using OrderedDict
also.
OrderedDict Benchmark
#!/usr/bin/env python3.9
from datetime import datetime
from collections import OrderedDict
def run(D):
A = [('a', None), ('b', None), ('r', None), ('c', None), ('d', None)]
B = [('1', None), ('2', None), ('3', None)]
t0 = datetime.now()
for i in range(5000000):
s = D(A)
s['e'] = None
del s['e']
s.update(B)
t = datetime.now() - t0
return t
a = run(dict)
b = run(OrderedDict)
m = min(a, b)
print(f'dict\t\t{a}\t{100*(a/m-1):+.0f}%')
print(f'OrderedDict\t{b}\t{100*(b/m-1):+.0f}%')
dict 0:00:02.280595 +0%
OrderedDict 0:00:04.608753 +102%
OrderedSet
is significantly slower than set
.
OrderedSet Benchmark
#!/usr/bin/env python3.9
from datetime import datetime
from marshmallow.orderedset import OrderedSet
def run(D, i=200000):
A = ['a', 'b', 'c', 'd']
B = D(['1', '2', '3'])
t0 = datetime.now()
for _ in range(i):
s = D(A)
s.add('e')
s.remove('e')
s &= B
return datetime.now() - t0
a = run(set)
b = run(OrderedSet)
m = min(a, b)
print(f'set\t\t{a}\t{100*(a/m-1):+.0f}%')
print(f'OrderedSet\t{b}\t{100*(b/m-1):+.0f}%')
set 0:00:00.065251 +0%
OrderedSet 0:00:01.575174 +2314%
If an application has schema instantiation in its hot loop I think the vast majority of the execution time is going to be consumed by memory allocation. +1 for dropping (or deprecating + ignoring) ordered
in 4.0 and using OrderedSet
(and dict
). I have read that there are more performant implementations of OrderedSet
out there that are wrappers around the 3.7+ ordered dict
.
I believe that is caused by
OrderedDict
. None of theOrderedSet
instance attributes are ever accessed duringdump()
, only theOrderedDict
dump_fields
.
Yes, this is exactly what I meant. Hence my proposal. In 4.0, I'd drop ordered
and just allow setting set and dict classes as class attributes. Meanwhile, we can already use OrderedSet
by default.
@lafrech I was intrigued by https://github.com/marshmallow-code/marshmallow/pull/1891#issuecomment-952716370, so I took a stab at replacing set_class
with list
. The only set operations we use are unique, intersection, and difference. I implemented them as generators with a set
membership check in 4-5 line utilities. Some operations may be more performant than the existing unordered set
logic, because they only build 1 set instead of (up to) 3.
https://github.com/marshmallow-code/marshmallow/compare/dev...deckar01:1744-list-attributes
@deckar01 Nice.
On the short run (marshmallow 3.x) we'll probably be keeping set_class
if only for backwards compatibility. There seems to be consensus with this so we may go on with #1896. This can wait until we drop Python 3.6.
The question of whether we keep vendoring the OrderedSet
recipe or we import it from a 3rd-party package is independent (#1891).
Later on (marshmallow 4), we can either keep set_class
as a class property, either hide it as an implementation detail. Then we may consider switching to an implementation such as yours. I have no strong feeling. Some would call that reinventing the wheel, some would like the simplicity. The code is small enough to be error-proofed, but arguably it is a little less straight-forward when used from Schema
.
+1 for this. It will become a key issue when the user generates the local OpenAPI spec file with apispec while the spec output is not deterministic (https://github.com/apiflask/apiflask/issues/373).