marshmallow Investigate Serialization Performance

Profile schema serialization. Compare the JIT hooks proposed in #629 with other optimization techniques.

May 14 '18 20:05 deckar01

/cc @taion

May 14 '18 20:05 deckar01

Any profiling results that are shared here should be accompanied by the code used to generate them so that they can be verified and tested on other systems.

May 14 '18 20:05 deckar01

I did some casual profiling a while ago... it looks like you need to strip away a ridiculous amount of logic to get to toasted-marshmallow-level performance. Just stuff like inlining the serialization loop and replacing the generic "get value" function with just getattr only made up a relatively small portion of the gap. Even doing definitely-wrong things like dropping error handling didn't make up much of the gap. I'll see if I still have the code around when I get back to my work computer on Monday.

I haven't had a chance to read through the toasted-marshmallow code, but I'm even more surprised now over how fast it is.

That said, it's pretty complicated, and it's hard to tell if it's in active use in many places, so I'm a bit scared to pull it on as a dependency for my code. But it speeds things up by a shocking amount.

cc @rowillia

Jun 10 '18 18:06 taion

For my testing I modified the benchmark to dump single objects, rather than many, since this more closely reflects the use case of a server dumping on request

Source: https://github.com/marshmallow-code/marshmallow/compare/dev...f35f908

Marshmallow (dev)

Toasted Marshmallow (latest)

toasted isn't just Jiting marshmallow. The readme states "Even PyPy benefits from toastedmarshmallow!". If it was just Jit, PyPy wouldn't show any significant improvements.

After inspecting the jit source, I identified a non-jit optimization that provides a significant speed up and should be reasonable to maintain. get_value/get_attribute appears to represent the entire time difference in these two profiles. toasted avoids this overhead by probing the type of the data container once, then chooses a specialized module that plucks those attributes. This would likely only require adding 2-3 small methods and not require any interface changes.

As a proof of concept, replacing self.get_attribute with the builtin getattr in Schema._serialize got rid of most of the time difference between the two profiles.

PoC

Sep 11 '19 16:09 deckar01

@deckar01 Hmm, I think in general, most of our time spent in Marshmallow is actually for doing a many dump on list endpoints. I guess I'm just a little surprised that the time difference you see is so minimal. The T-M benchmarks show a significantly larger speedup for using T-M.

Sep 11 '19 16:09 taion

They are using an old fork of v2. We have landed several significant performance improvements in v3. I ran the original benchmark with the many dump and the results aren't that much different:

marshmallow (dev): 1,616.64 usec/dump
toasted (latest): 700.67 usec/dump
marshmallow (PoC): 1,037.69 usec/dump

The remaining difference should be analyzed, but bypassing get_value still accounted for ~63% of the time difference between the two profiles.

Sep 11 '19 18:09 deckar01

I thought they were working off a v3 beta.

I'm just really surprised at the differences here. If anything I feel like they're worth understanding for their own sake. This would imply that Marshmallow has sped up on serialization by ~5x since T-M was made a thing?

Sep 11 '19 18:09 taion

T-M is using a vendorized fork of 2.15.1.

v3's removal of all validation on serialization probably made up a significant portion of the difference.

Sep 11 '19 18:09 sloria

well, ma.utils.get_value does a lot more than just getattr, anyway. dotted getter notation and all that.

if serialization is actually 5x faster than it used to be, then maybe there's not much of a problem left. the logic in get_value handles a bunch of edge cases around e.g. dictionaries with properties or other weird combinations of ways to subscript things, so it's not as easy of a drop-in as it might look at first glance

Sep 12 '19 18:09 taion

toasted doesn't support dotted attribute syntax. There is a comment about it in the source code, but it doesn't seem to be documented in the readme.

https://github.com/lyft/toasted-marshmallow/blob/00a8e761/toastedmarshmallow/jit.py#L669-L672

The name of a field never changes once the schema is constructed, which is why the optimization works. We should be able to determine the accessor method during schema construction and cache it, so it doesn't have to be recomputed on every dump.

Sep 12 '19 18:09 deckar01

The complexity is that the accessor is a function of both the object and the field name.

The same schema with a name field currently can handle both an instance of a Widget class and read widget.name as well as a dictionary with {"name": "whatever"}.

So we'd at least need something like what I recall was T-M-style logic of caching or memoizing based on what's passed in.

Sep 12 '19 19:09 taion

Ya, the init-time cache could only detect if the field name is dotted... We would need a dump-time check for the data type of the input that caches the actual accessor function. Currently both conditions are recomputed on each individual field dump.

Sep 12 '19 19:09 deckar01

Can't there be mixed object types, like a Widget instance with items to get, or a dict child class with attributes? Therefore the need for the per-field check?

Sep 12 '19 19:09 lafrech

Yup, exactly. You'd need to cache/memoize on the type of the object being serialized to be reasonably safe.

Sep 12 '19 19:09 taion

Now that we assume that data to dump is valid, deserialized data, the types should be known upfront, right?

Sep 12 '19 20:09 sloria

Therefore the need for the per-field check?

toasted checks the data up front and only falls back to the get_value behavior if it can be accessed both ways. The majority of use cases will be dicts or simple class instances, so being able to call getattr or getprop directly should be a major speedup for the most common use cases.

Sep 12 '19 20:09 deckar01

@deckar01 Sure, so I think a starting point would be to just memoize the last accessor based on the type of the object being dumped.

Sep 12 '19 20:09 taion

I agree with the direction here. Do we agree that the target end state is to cache accessors for all fields ahead of time? Since we assume dump input is valid data with known shape/types, the generic utils.get_value or memoization-with-fallbacks seems unnecessary.

I imagine the 'accessor factory' could be defined per-class. Something like

class BaseSchema(SchemaABC):
    class Meta:
        attribute_getter = operator.attrgetter  # default

# ---

class DictSchema(Schema):
    class Meta:
        attribute_getter = operator.itemgetter

Sep 12 '19 20:09 sloria

Do we agree that the target end state is to cache accessors for all fields ahead of time?

Ya. The discussion so far has pointed to once at the beginning of schema as ahead of time.

Requiring the user to declare the schema's data type is probably the fastest way to handle it (and how I modeled it when building marsha to explore marshmallow optimizations). I am all for splitting slow code into separate paths that have to be explicitly opted into (ideally with overloaded interfaces rather than flags). This seems like it would be backwards incompatible unless the default was the slow path. It would be nice if it was fast by default.

Moving the type checking out of the field accessor and into the schema dump should preserve the existing functionality and be backwards compatible.

Sep 12 '19 21:09 deckar01

Related to my above comment: we could even support setting getters in both directions (dump and load). This would allow you to validate objects (not just dicts), which is clunky right now.

class Meta:
    dump_getter = operator.attrgetter
    load_getter = operator.attrgetter

Sep 12 '19 22:09 sloria

That said, load-ing objects is semantically tricky. For example, what would unknown validation look like? Maybe it's not something we should support.

Sep 13 '19 02:09 sloria

I think the clunkyness of validating objects does not come from the attr/item getters but from the fact that the keys may have different names (data_key), let alone the pre/post-load/dump decorators.

Sep 13 '19 06:09 lafrech

@taion A memoization table has a significant amount of overhead. I compared memoizing get_accessor(obj) to passing the value into schema.get_attribute():

`get_attribute`	time (s)	relative
baseline	2.75	--
lru_cache	1.45	-47%
arg	0.71	-74%

Sep 13 '19 15:09 deckar01

Can we hard-memoize only the most recent value, and do it per schema?

A little tricky where to store the per-field cached accessors anyway, no? I guess it could be some sort of map on the schema class or something?

Sep 13 '19 19:09 taion

Hi folks, I'm wondering if there's any updates on serialization performance? I'm running into a bottleneck when dumping db objects into a nested schema. Alternatively, is there anything you can recommend to optimize this? TIA Screen Shot 2022-10-21 at 3 24 16 PM

Oct 21 '22 21:10 katosad

marshmallow marshmallow copied to clipboard

Investigate Serialization Performance

Marshmallow (dev)

Toasted Marshmallow (latest)

PoC

marshmallow
marshmallow copied to clipboard