marshmallow Support streaming dump

If you feed a generator into a many=True schema, Marshmallow builds up the entire generator into memory before serializing it. This makes serializing a collection of many elements take longer and consume more memory than is necessary, possibly even exceeding the available memory of the system or time limits in the environment. These concerns are especially common in web services, where Marshmallow is often used for serializing JSON response bodies, and where web workers often run in memory-constrained environments, and clients or gateways will time out if the service takes too long to start streaming a response.

Users can currently hack around this with something like this:

from typing import Iterable
from marshmallow import Schema


def dumps_many(obj: Iterable, schema: Schema):
    schema.many = False
    yield "["
    it = iter(obj)
    i = next(it, None)
    while i is not None:
        yield schema.dumps(i)
        i = next(it, None)
        if i is not None:
            yield ","
    yield "]"
    schema.many = True


if __name__ == "__main__":
    import sys
    from marshmallow.fields import Int

    class MySchema(Schema):
        i = Int(required=True)

    obj = (dict(i=i) for i in range(int(sys.argv[1])))
    print(repr("".join(dumps_many(obj, MySchema(many=True)))))


# $ python3 foo.py 0
# '[]'
# $ python3 foo.py 1
# '[{"i": 0}]'
# $ python3 foo.py 2
# '[{"i": 0},{"i": 1}]'
# $ python3 foo.py 9999999999999  # you get the idea
# ...

But it would be great if Marshmallow offered first-class support for this.

Looks like this was previously discussed briefly in https://github.com/marshmallow-code/marshmallow/pull/1164#issuecomment-473316007 where @deckar01 said

We might want to explore streaming with generators in 3.x.

Is now a good time to add this to Marshmallow v3? Could be another really strong reason for v2 users to upgrade.

Thanks for your consideration and for the great work on Marshmallow!

Nov 24 '20 20:11 jab

Yes, this is certainly worth revisiting. If I'm not mistaken, https://github.com/marshmallow-code/marshmallow/pull/810 should have obviated the need to consume generators into memory.

So this might be as simple as removing https://github.com/marshmallow-code/marshmallow/blob/fa6c7379468f59d4568e29cbbeb06b797d656215/src/marshmallow/schema.py#L549-L550 , though I've not given much thought to the consequences. @jab Would you be up for doing a more thorough investigation of this?

Dec 01 '20 21:12 sloria

Hi @sloria, thanks for looking at this, and great this should be possible now and is worth revisiting!

I investigated a bit and committed the results so far in https://github.com/jab/marshmallow/commit/21b8c767b37e73793cd7b19eae8d668fdf26263f.

With the changes you suggested above, all tests still passed, but unfortunately that wasn't quite enough to achieve streaming dumps. It looks like this is because in the many=True case, Schema._serialize() (which gets called by Schema.dump/dumps) also builds up the entire list into memory before returning it, see: https://github.com/marshmallow-code/marshmallow/blob/324766619c885965c9f850c4034efe1855d28b3c/src/marshmallow/schema.py#L516-L520

I tried changing that from a list comprehension to a generator comprehension, but then the tests no longer passed. So it seems like the fix for this is at least a little more involved.

Still, I hope this was helpful, and that equipped with these results, it will be easy for you or another contributor with more familiarity with Marshmallow internals to fix this. Thanks again for taking a look.

Feb 02 '21 15:02 jab

Actually, I realized that streaming dump should be supported for all schemas, not just many=True schemas (and renamed the issue to make this clearer).

For example, it should still be possible to dump a many=False schema like the following in a streaming fashion:

class Foo(Schema):
    ints = fields.List(fields.Integer())

for chunk in Foo().dump_streaming({"ints": range(9999999)}):
    print(chunk)

# or supposing "app" is a Flask app:
@app.route("/foo")
def foo():
    # Since this returns a generator, Flask streams the response body just fine:
    return Foo().dump_streaming({"ints": range(9999999)})

It looks like Python's built-in json library offers streaming serialization in a separate JSONEncoder.iterencode(...) method. I'm not sure, but rather than changing marshmallow.Schema's existing dump/dumps methods, perhaps it'd be easier to add this in a separate method like iterencode?

Here is a very basic demo of doing exactly that: https://github.com/jab/marshmallow/commit/97cbc7a21e694dfbbe0d7509ad08595f7a7d2455

(Note, I used simplejson instead of json there since its iterable_as_array option makes for a particularly concise demo with a generator. The same is possible with the built-in json library, unfortunately it just requires a little more work, namely a custom JSONEncoder subclass with similar generator support.)

Feb 02 '21 17:02 jab

Hi @sloria, do you have any thoughts on this? Thanks!

Apr 03 '21 18:04 jab

@jab Apologies for the delay; took a break from marshmallow work for the past few months due to professional/personal priorities.

Thanks for doing that investigation. So it appears that the initial list cast is unnecessary; sent a PR to remove that in https://github.com/marshmallow-code/marshmallow/pull/1785. But as you pointed out, the serialization result of a many Schema will still be a list. Same with List fields.

Perhaps we could add a fields.Generator to support streaming fields. As for supporting streaming with many Schemas, I'm still not 100% sure this is something that belongs in marshmallow core. Simply put: it's a niche use case that may not be worth the complexity/added API surface at this time.

For now, I'd suggest implementing a base Schema subclass and Generator field that serialize to generators.

Apr 03 '21 20:04 sloria