marshmallow icon indicating copy to clipboard operation
marshmallow copied to clipboard

[RFC] set_class = OrderedSet by default?

Open lafrech opened this issue 3 years ago • 6 comments

I assume the performance impact would be minimal. From a quick look, it would only affect schema instantiation.

The benefit is that when using Python 3.7+, all schemas would be ordered. They still wouldn't return OrderedDict instances unless ordered Meta is passed, but they would keep fields declaration order in their output and in their schema list, so that order would be respected in API responses and apispec users would get their fields documented in declaration order.

Currently, users can achieve this in a base schema with

class OrderedBaseSchema(Schema):
    set_class = OrderedSet

My gut feeling is that this change would have no functional downside and a negligible perf impact at field init (less that we care about when adding a new feature), while using ordered=True does have an impact on serialization performance (I didn't measure it) and functionality (equality test depends on order), hence my proposal. There would still be an interest in allowing the use of OrderedDict.

If we go this route, we could use OrderedSet by default in marshmallow 3.x, and in marshmallow 4 perhaps rework the ordered feature because schemas would already be ordered so it would be specifically for users really needing an OrderedDict instance. Perhaps drop the ordered Meta option and let the user specify a dict class of their choice another way. But that's another story.

See https://github.com/marshmallow-code/flask-smorest/issues/228#issuecomment-777836226.

lafrech avatar Feb 12 '21 00:02 lafrech

I'm not opposed to this, so long as there is no perf impact (I think your intuition on this is right).

sloria avatar Mar 18 '21 03:03 sloria

Or since this is easily achieved in user code thanks to a base schema as shown above, we postpone this to marshmallow 4 as described in my last paragraph.

Adding v4 milestone for now.

lafrech avatar Apr 08 '21 15:04 lafrech

using ordered=True does have an impact on serialization performance

I believe that is caused by OrderedDict. None of the OrderedSet instance attributes are ever accessed during dump(), only the OrderedDict dump_fields. Even though dict is ordered now, it is still faster than OrderedDict. Waiting until 3.7+ would avoid using OrderedDict also.

OrderedDict Benchmark
#!/usr/bin/env python3.9

from datetime import datetime
from collections import OrderedDict


def run(D):
    A = [('a', None), ('b', None), ('r', None), ('c', None), ('d', None)]
    B = [('1', None), ('2', None), ('3', None)]
    t0 = datetime.now()
    for i in range(5000000):
        s = D(A)
        s['e'] = None
        del s['e']
        s.update(B)
    t = datetime.now() - t0
    return t


a = run(dict)
b = run(OrderedDict)
m = min(a, b)

print(f'dict\t\t{a}\t{100*(a/m-1):+.0f}%')
print(f'OrderedDict\t{b}\t{100*(b/m-1):+.0f}%')
dict            0:00:02.280595  +0%
OrderedDict     0:00:04.608753  +102%

OrderedSet is significantly slower than set.

OrderedSet Benchmark
#!/usr/bin/env python3.9

from datetime import datetime
from marshmallow.orderedset import OrderedSet


def run(D, i=200000):
    A = ['a', 'b', 'c', 'd']
    B = D(['1', '2', '3'])
    t0 = datetime.now()
    for _ in range(i):
        s = D(A)
        s.add('e')
        s.remove('e')
        s &= B
    return datetime.now() - t0


a = run(set)
b = run(OrderedSet)
m = min(a, b)

print(f'set\t\t{a}\t{100*(a/m-1):+.0f}%')
print(f'OrderedSet\t{b}\t{100*(b/m-1):+.0f}%')
set             0:00:00.065251  +0%
OrderedSet      0:00:01.575174  +2314%

If an application has schema instantiation in its hot loop I think the vast majority of the execution time is going to be consumed by memory allocation. +1 for dropping (or deprecating + ignoring) ordered in 4.0 and using OrderedSet (and dict). I have read that there are more performant implementations of OrderedSet out there that are wrappers around the 3.7+ ordered dict.

deckar01 avatar Oct 26 '21 18:10 deckar01

I believe that is caused by OrderedDict. None of the OrderedSet instance attributes are ever accessed during dump(), only the OrderedDict dump_fields.

Yes, this is exactly what I meant. Hence my proposal. In 4.0, I'd drop ordered and just allow setting set and dict classes as class attributes. Meanwhile, we can already use OrderedSet by default.

lafrech avatar Oct 27 '21 07:10 lafrech

@lafrech I was intrigued by https://github.com/marshmallow-code/marshmallow/pull/1891#issuecomment-952716370, so I took a stab at replacing set_class with list. The only set operations we use are unique, intersection, and difference. I implemented them as generators with a set membership check in 4-5 line utilities. Some operations may be more performant than the existing unordered set logic, because they only build 1 set instead of (up to) 3.

https://github.com/marshmallow-code/marshmallow/compare/dev...deckar01:1744-list-attributes

deckar01 avatar Oct 27 '21 15:10 deckar01

@deckar01 Nice.

On the short run (marshmallow 3.x) we'll probably be keeping set_class if only for backwards compatibility. There seems to be consensus with this so we may go on with #1896. This can wait until we drop Python 3.6.

The question of whether we keep vendoring the OrderedSet recipe or we import it from a 3rd-party package is independent (#1891).

Later on (marshmallow 4), we can either keep set_class as a class property, either hide it as an implementation detail. Then we may consider switching to an implementation such as yours. I have no strong feeling. Some would call that reinventing the wheel, some would like the simplicity. The code is small enough to be error-proofed, but arguably it is a little less straight-forward when used from Schema.

lafrech avatar Nov 08 '21 22:11 lafrech

+1 for this. It will become a key issue when the user generates the local OpenAPI spec file with apispec while the spec output is not deterministic (https://github.com/apiflask/apiflask/issues/373).

greyli avatar Dec 24 '22 04:12 greyli