marshmallow
marshmallow copied to clipboard
Investigate Serialization Performance
Profile schema serialization. Compare the JIT hooks proposed in #629 with other optimization techniques.
/cc @taion
Any profiling results that are shared here should be accompanied by the code used to generate them so that they can be verified and tested on other systems.
I did some casual profiling a while ago... it looks like you need to strip away a ridiculous amount of logic to get to toasted-marshmallow-level performance. Just stuff like inlining the serialization loop and replacing the generic "get value" function with just getattr
only made up a relatively small portion of the gap. Even doing definitely-wrong things like dropping error handling didn't make up much of the gap. I'll see if I still have the code around when I get back to my work computer on Monday.
I haven't had a chance to read through the toasted-marshmallow code, but I'm even more surprised now over how fast it is.
That said, it's pretty complicated, and it's hard to tell if it's in active use in many places, so I'm a bit scared to pull it on as a dependency for my code. But it speeds things up by a shocking amount.
cc @rowillia
For my testing I modified the benchmark to dump single objects, rather than many
, since this more closely reflects the use case of a server dumping on request
Source: https://github.com/marshmallow-code/marshmallow/compare/dev...f35f908
Marshmallow (dev)

Toasted Marshmallow (latest)

toasted isn't just Jiting marshmallow. The readme states "Even PyPy benefits from toastedmarshmallow!". If it was just Jit, PyPy wouldn't show any significant improvements.
After inspecting the jit source, I identified a non-jit optimization that provides a significant speed up and should be reasonable to maintain. get_value
/get_attribute
appears to represent the entire time difference in these two profiles. toasted avoids this overhead by probing the type of the data container once, then chooses a specialized module that plucks those attributes. This would likely only require adding 2-3 small methods and not require any interface changes.
As a proof of concept, replacing self.get_attribute
with the builtin getattr
in Schema._serialize
got rid of most of the time difference between the two profiles.
PoC

@deckar01 Hmm, I think in general, most of our time spent in Marshmallow is actually for doing a many
dump on list endpoints. I guess I'm just a little surprised that the time difference you see is so minimal. The T-M benchmarks show a significantly larger speedup for using T-M.
They are using an old fork of v2. We have landed several significant performance improvements in v3. I ran the original benchmark with the many
dump and the results aren't that much different:
- marshmallow (dev): 1,616.64 usec/dump
- toasted (latest): 700.67 usec/dump
- marshmallow (PoC): 1,037.69 usec/dump
The remaining difference should be analyzed, but bypassing get_value
still accounted for ~63% of the time difference between the two profiles.
I thought they were working off a v3 beta.
I'm just really surprised at the differences here. If anything I feel like they're worth understanding for their own sake. This would imply that Marshmallow has sped up on serialization by ~5x since T-M was made a thing?
T-M is using a vendorized fork of 2.15.1.
v3's removal of all validation on serialization probably made up a significant portion of the difference.
well, ma.utils.get_value
does a lot more than just getattr
, anyway. dotted getter notation and all that.
if serialization is actually 5x faster than it used to be, then maybe there's not much of a problem left. the logic in get_value
handles a bunch of edge cases around e.g. dictionaries with properties or other weird combinations of ways to subscript things, so it's not as easy of a drop-in as it might look at first glance
toasted doesn't support dotted attribute syntax. There is a comment about it in the source code, but it doesn't seem to be documented in the readme.
https://github.com/lyft/toasted-marshmallow/blob/00a8e761/toastedmarshmallow/jit.py#L669-L672
The name of a field never changes once the schema is constructed, which is why the optimization works. We should be able to determine the accessor method during schema construction and cache it, so it doesn't have to be recomputed on every dump.
The complexity is that the accessor is a function of both the object and the field name.
The same schema with a name
field currently can handle both an instance of a Widget
class and read widget.name
as well as a dictionary with {"name": "whatever"}
.
So we'd at least need something like what I recall was T-M-style logic of caching or memoizing based on what's passed in.
Ya, the init-time cache could only detect if the field name is dotted... We would need a dump-time check for the data type of the input that caches the actual accessor function. Currently both conditions are recomputed on each individual field dump.
Can't there be mixed object types, like a Widget
instance with items to get, or a dict
child class with attributes? Therefore the need for the per-field check?
Yup, exactly. You'd need to cache/memoize on the type of the object being serialized to be reasonably safe.
Now that we assume that data to dump is valid, deserialized data, the types should be known upfront, right?
Therefore the need for the per-field check?
toasted checks the data up front and only falls back to the get_value behavior if it can be accessed both ways. The majority of use cases will be dicts or simple class instances, so being able to call getattr
or getprop
directly should be a major speedup for the most common use cases.
@deckar01 Sure, so I think a starting point would be to just memoize the last accessor based on the type
of the object being dumped.
I agree with the direction here. Do we agree that the target end state is to cache accessors for all fields ahead of time? Since we assume dump
input is valid data with known shape/types, the generic utils.get_value
or memoization-with-fallbacks seems unnecessary.
I imagine the 'accessor factory' could be defined per-class. Something like
class BaseSchema(SchemaABC):
class Meta:
attribute_getter = operator.attrgetter # default
# ---
class DictSchema(Schema):
class Meta:
attribute_getter = operator.itemgetter
Do we agree that the target end state is to cache accessors for all fields ahead of time?
Ya. The discussion so far has pointed to once at the beginning of schema as ahead of time.
Requiring the user to declare the schema's data type is probably the fastest way to handle it (and how I modeled it when building marsha to explore marshmallow optimizations). I am all for splitting slow code into separate paths that have to be explicitly opted into (ideally with overloaded interfaces rather than flags). This seems like it would be backwards incompatible unless the default was the slow path. It would be nice if it was fast by default.
Moving the type checking out of the field accessor and into the schema dump should preserve the existing functionality and be backwards compatible.
Related to my above comment: we could even support setting getters in both directions (dump and load). This would allow you to validate objects (not just dict
s), which is clunky right now.
class Meta:
dump_getter = operator.attrgetter
load_getter = operator.attrgetter
That said, load
-ing objects is semantically tricky. For example, what would unknown
validation look like? Maybe it's not something we should support.
I think the clunkyness of validating objects does not come from the attr/item getters but from the fact that the keys may have different names (data_key
), let alone the pre/post-load/dump decorators.
@taion A memoization table has a significant amount of overhead. I compared memoizing get_accessor(obj)
to passing the value into schema.get_attribute()
:
get_attribute |
time (s) | relative |
---|---|---|
baseline | 2.75 | -- |
lru_cache | 1.45 | -47% |
arg | 0.71 | -74% |
Can we hard-memoize only the most recent value, and do it per schema?
A little tricky where to store the per-field cached accessors anyway, no? I guess it could be some sort of map on the schema class or something?
Hi folks, I'm wondering if there's any updates on serialization performance? I'm running into a bottleneck when dumping db objects into a nested schema. Alternatively, is there anything you can recommend to optimize this? TIA