marshmallow
marshmallow copied to clipboard
RFC: Remove Nested(many=True) to keep only the List(Nested()) form
See discussion in https://github.com/marshmallow-code/marshmallow/issues/493#issuecomment-380468027.
We have two ways to do the same thing. They sometimes behave differently, which generally means one of them is buggy.
IMHO, and from what I have seen from users around me, it is more natural for a new user to write List(Nested) than Nested(many). Besides, it is consistent with Dict(values=Nested).
I realize this is quite a breaking change, both from user perspective, because Nested(many=True) is probably the most commonly used form, and from developer perspective, because it involves a lot of code changes (but also a lot of code removal, which is nice).
There could be things that can be done with Nested(many=True) but not with List(Nested), because in the latter case, the nested Schema has access to the whole data, while in the former, it is called on each element. Perhaps some pre_load edge cases might suffer from the change. Any real-life example, anyone?
Anyway, since the question was raised in a comment in another issue, I figured I'd give it more visibility.
Is there any documentation outlining the differences between these two? I find the distinction confusing as well.
I don't think there is supposed to be a difference.
I think deprecating Nested(many=True) is a good idea. Would someone be up for doing an analysis of the current differences between the two APIs?
Do you mean deprecate in 2.X and remove in 3 or deprecate in 3?
Here are a few differences:
- Partial data if an item fails validation [#766] (Fixed in #1052)
- Validation error issue with
Nested(many=True)[#493] (List(Nested)does the right thing.) Listis not propagating the context between a nested field and the parent schema [#922] (Fixed)onlyargument inconsistent between Nested(S, many=True) and List(Nested(S)) [#946] (Fixed in #1229)- Different error message when deserializing an invalid input (not a list) (
List(Nested)does the right thing.) partialnot propagated toList(Nested)(Fixed in #1230)
@lafrech I was thinking of deprecating in 3. As you pointed out, Nested(many=True) is the probably the more common usage, and removing it would break many users' code.
How would a self-referencing schema work with this? In other words, how would we express
friends = Nested('self', many=True, exclude=('friends',))
I would expect friends = List(Nested('self', exclude=('friends',))) to work the same way, but it currently errors. Using a reference to the schema with python 3 works, so I suspect this is just an edge case that hasn't been tested.
Perhaps some pre_load edge cases might suffer from the change. Any real-life example, anyone?
I've got production code that relies on the fact that a nested field with many=True and @pre_dump(pass_many=True) on its schema will execute this pre_dump only once.
The benefit is making sure expensive code that is needed to be executed before nested fields can be dumped is not repeated for each element in the fields.List.
I don't really see this as an edge case though, pretty sure doing something to the whole set of related objects before dumping/loading them is a common pattern.
Currently a list of nested fields does not propagate the schema context correctly, but nested many does. See #424 and #922.
Yep, I just updated the list above.
I saw the ref on the issue, but I couldn't figure out where it came from. Thanks.
@anka-sirota yes it seems like a common use case, but it very well could be done on parent schema or by subclassing List and doing custom pre_dump there. All that is possible without handling a separate many case, which I would I agree with @lafrech, just source of inconsistent in the library.
Looking at bigger picture, that this issue already have been here for couple months now, and there still few bugs before many=True can be complete eliminated from the library, I would suggest just calling it deprecated in marshmallow 3 release.
Such move will:
- prevent further delay of marshmallow 3
- make it so that there is more interest in
List(Nested)and that it works consistently withNested(many=True)
@lafrech another difference to be added to the list https://github.com/marshmallow-code/marshmallow/issues/946
@rooterkyberian, I was just searching for this thread to reference your issue.
All differences listed above are either fixed or on their way to be fixed. Once this is done, we should be able to deprecate Nested(many=True).
There is also a difference between Nested(Schema, many=True) and List(Nested(Schema)) when loading partial data. Below tests are showing the difference in behaviour when using both methods:
from marshmallow import Schema
from marshmallow.fields import List, Nested, Str
class Employee(Schema):
name = Str(required=True)
title = Str(required=True)
def test_nested_partial_load_with_list():
class Company(Schema):
employees = List(Nested(Employee))
payload = {
'employees': [{
'name': 'Arthur',
}],
}
# Partial loading with list type shouldn't generate any errors.
result = Company().load(payload, partial=True)
assert result['employees'][0]['name'] == 'Arthur'
def test_nested_partial_load_with_nested_many():
class Company(Schema):
employees = Nested(Employee, many=True)
payload = {
'employees': [{
'name': 'Arthur',
}],
}
# Partial loading with Nested(many=True) type shouldn't generate any errors.
result = Company().load(payload, partial=True)
assert result['employees'][0]['name'] == 'Arthur'
Partial loading in nested schemas was introduced in version 3.0.0rc1 in this PR: #849. Problem is that when a field is of type List, it is not getting the same kwargs as Nested so it can't ignore missing data. I think it was even mentioned in referenced discussion.
Thans @0xpsilocybe, I think this is the object of @sloria's comment in #1066. I shall update the PR to propagate all needed kwargs to Nested.
What about the Pluck(many=True) use case ?
Removing many would allow a few lines of explicit many handling to be removed from Pluck. The rest of the changes would automatically be inherited from Nested.
@lafrech Is there anything else to be addressed before we can deprecate Nested(many=True)? I think we'll want to investigate https://github.com/marshmallow-code/marshmallow/issues/779#issuecomment-396354987 .
In any case, I don't think we should block on this for releasing 3.0. We can always deprecate within the 3.x line.
@sloria :+1: on depreciating stuff rather than stacking the deck with more BC breaks. I'd much rather have 8 releases with 8 breaking changes than 8 breaking changes in one release...
I don't know about anyone else, but we've got hundreds - maybe even a thousand schemas to migrate. I'd rather take that in strides!
@fuhrysteve, about this, please see the conversation in https://github.com/marshmallow-code/marshmallow/issues/928.
Changed the milestone on this since there are still further considerations to discuss (deprecating Nested(many=True)), but I don't think there's any action items to complete before 3.0 final.
Indeed, deprecation can be done in 3.x so it doesn't have to hold 3.0.
I sent https://github.com/marshmallow-code/marshmallow/pull/1290 with a fix for that last List(Nested(...)) glitch we identified above (https://github.com/marshmallow-code/marshmallow/issues/779#issuecomment-396354987).
Before we officially recommend List(Nested(Schema)), it would be good to do some benchmarking and profiling. A quick-n-dirty benchmark shows that List(Nested(Schema)) is ~35% slower than Nested(Schema, many=True): https://gist.github.com/sloria/2e0b141b3d05fe0dafbdeafd294ebae8
I might try to dig into a perf dump and see where the time is going. That seems like a lot of time to just be caused by an extra layer of indirection. I wonder if Nested isn't doing something it should be or if List is doing something it doesn't need.
I ran the benchmarks above using python 3.7.2 against the dev branch. I had to make a copy of Schema to get sane cProfile output, because recursion merges calls into a single total and hides the problem.
Nested(many)
List(Nested)
It looks like the slowdown is caused by Nested(many) calling schema.dump(many) once and getting to loop on schema._serialize(), but List(Nested) having to loop on schema.dump().
About 20% of that slowdown is caused by calls to _has_processors() which is cacheable and ErrorStore() which could possibly be removed or refactored as a shared instance. Optimizing these would also make schema.dump() calls slightly faster in general.
The rest of the slow down seems to be memory allocation and conditional evaluation in dump(). The easiest way to fix that is probably to probe the list's inner field type and allow it to hit the optimized schema path once via Nested._serialize(many).
#1304 makes dump() ~18% faster for both paths. #1306 makes dump() ~21% faster for List(Nested), which makes the two paths run in effectively the same amount of time (< 1%).
Great.
Is there the same kind of gap between Nested(many) and List(Nested) when loading?
Ya, load()has the same ~35% slowdown when comparing List(Nested) to Nested(many). The same strategy should work there.
Edit: I added the fix for load() in #1306 and confirmed that it makes them run in the same time without slowing down Nested(many).