marshmallow_dataclass icon indicating copy to clipboard operation
marshmallow_dataclass copied to clipboard

support generic dataclasses

Open onursatici opened this issue 3 years ago • 27 comments

Previously, generic dataclasses were not supported even if they had type arguments, now it is possible to do the following:

T = typing.TypeVar("T")

@dataclasses.dataclass
class SimpleGeneric(typing.Generic[T]):
    data: T

schema_int = marshmallow_dataclass.class_schema(SimpleGeneric[int])  # this works
schema = marshmallow_dataclass.class_schema(SimpleGeneric)  # this doesn't, no type arguments

It is also possible to use such generic aliases nested in another dataclass:

@dataclasses.dataclass
class Nested:
    data: SimpleGeneric[int]
    
schema = marshmallow_dataclass.class_schema(Nested)  # generated schema will validate data has an integer value

onursatici avatar Jan 11 '22 18:01 onursatici

@lovasoa does this seem like the right approach to you? I wanted to keep the previous behaviour of not generating a schema for a generic dataclass if it does not specify any type arguments, but if they do, it feels like it would be a good feature for marshmallow-dataclass to be able to create a schema, respecting the given type arguments

onursatici avatar Jan 21 '22 22:01 onursatici

I don't know how easy it is to do, but it would be nice to generalize some more.

As an example, currently, with this PR, the following does not work, though there are no unfixed type parameters.

from dataclasses import dataclass
from typing import Generic
from typing import TypeVar

from marshmallow_dataclass import class_schema

T = TypeVar("T")

@dataclass
class Simple(Generic[T]):
    x: T

@dataclass
class Nested(Generic[T]):
    nested: Simple[T]

schema_class = class_schema(Nested[int])

(⇒ TypeError: T is not a dataclass and cannot be turned into one.)

dairiki avatar Jan 26 '22 22:01 dairiki

Hello ! Sorry for the delay ! Looks like there is a type issue on python 3.7

lovasoa avatar Apr 21 '22 16:04 lovasoa

Sorry for the delay here, I will take a look to the mypy complaint and will see if it is low lift to support nested generic dataclasses. It might take a while until I find the time so feel free to close till then

onursatici avatar Jun 02 '22 12:06 onursatici

Will this get merged? Is there a workaround without mergin this PR?

jemshit avatar Nov 21 '22 11:11 jemshit

I will be happy to merge it when conflicts are fixed and the tests pass. @jemshit, if you want to pick up the work where it was left off, that would be welcome !

lovasoa avatar Nov 21 '22 13:11 lovasoa

@lovasoa should be good for a review, I haven't ran tests for older python versions though. Appreciate if you could kick off CI again

onursatici avatar Nov 28 '22 09:11 onursatici

I have also added a test to cover @dairiki 's comment. It should support nested generics now and it should only work when the same concrete type is used for the same TypeVar

onursatici avatar Nov 28 '22 09:11 onursatici

@onursatici Tests were failing due to pre-commit hook bitrot/breakage. I've fixed that and merged those fixes here.

Now there is a test failure for python < 3.10. I haven't looked at it carefully yet, to see what the fix (or even the cause) might be.

dairiki avatar Nov 30 '22 01:11 dairiki

Thanks a lot @dairiki , it seems like __name__ was not an attribute of generic alias prior to 3.10, I can have a look

onursatici avatar Dec 01 '22 16:12 onursatici

Also this implementation has a bug for repeated use of the same generic dataclasses.

Example:

# class B is a generic dataclass
class A:
  a: B[str]
  a: B[int]
  c: B[int]

Because marshmallow_dataclass supports forward references, the field generated for c would be a Nested field, with the forward reference "B" as its argument. Now when using this schema, marshmallow will happily resolve the forward schema to be the first schema it has registered for this class B, which is of the B[str] variant, because it does appear as the first field with a B type. Because of this behaviour the schema of A will fail to deserialise field c, because it would treat it as B[str].

one ugly fix would be reordering fields:

class A:
  a: B[int]
  c: B[int]
  a: B[str]

now field c would be correctly identified as B[int] because it is the first B marshmallow sees, from field a.

Fixing this in general without breaking forward reference support is a bit tricky, because marshmallow's class registry is oblivious to generic arguments. We can accept that this is not something we support or log warning, but that would require keeping track of extra state (which variant of this generic class have we seen first, and is it different than the variant we are trying to use a forward reference for now, if so, log warning)

onursatici avatar Dec 01 '22 16:12 onursatici

@dairiki fixed conflicts and the failing tests for python < 3.10. Also added a test for my comment above

onursatici avatar Dec 09 '22 11:12 onursatici

@onursatici Thank you!

The tests under python 3.6 had started failing due to the upgrade of the ubuntu-latest runners from 20.04 to 22.04. I've fixed that and merged that fix here.

There's still a test failure under python 3.6. (I haven't looked at it in detail.)

dairiki avatar Dec 10 '22 18:12 dairiki

@dairiki py3.6 should be fine now based on my local testing. Would be nice to kick CI one more time 🤞

onursatici avatar Dec 15 '22 04:12 onursatici

@onursatici I still haven't found time to dig into this deeply, so these may be stupid questions, but...

Also this implementation has a bug for repeated use of the same generic dataclasses.

Example:

# class B is a generic dataclass
class A:
  a: B[str]
  a: B[int]
  c: B[int]

Because marshmallow_dataclass supports forward references, the field generated for c would be a Nested field, with the forward reference "B" as its argument. Now when using this schema, marshmallow will happily resolve the forward schema to be the first schema it has registered for this class B, which is of the B[str] variant, because it does appear as the first field with a B type. Because of this behaviour the schema of A will fail to deserialise field c, because it would treat it as B[str].

Does this example depend on having the repeated a field? (Or could the second a field be named b?)

one ugly fix would be reordering fields:

class A:
  a: B[int]
  c: B[int]
  a: B[str]

now field c would be correctly identified as B[int] because it is the first B marshmallow sees, from field a.

Why doesn't this break the (second) a field? Wouldn't it mistakenly be treated as B[int]?

What happens if there are multiple dataclasses with B fields?

class A1:
    x: B[int]
SchemaA1 = class_schema(A1)

class A2:
    y: B[str]
SchemaA2 = class_schema(A2)

Does the same issue pertain?

Fixing this in general without breaking forward reference support is a bit tricky, because marshmallow's class registry is oblivious to generic arguments.

What would fixing that involve?

We can accept that this is not something we support or log warning, but that would require keeping track of extra state (which variant of this generic class have we seen first, and is it different than the variant we are trying to use a forward reference for now, if so, log warning)

This does seem like creepy behavior, especially so if we can't issue a warning when someone bumps into it.

dairiki avatar Dec 15 '22 18:12 dairiki

@dairiki here is my understanding:

Does this example depend on having the repeated a field? (Or could the second a field be named b?)

my bad, the issue happens when the fields have distinct names, so the example should have been:

# class B is a generic dataclass

# this is not fine
class A:
  a: B[str]
  b: B[int]
  c: B[int]

# this is fine
class A:
  b: B[int]
  c: B[int]
  a: B[str]

Why doesn't this break the (second) a field? Wouldn't it mistakenly be treated as B[int]?

As far as I understand _RECURSION_GUARD.seen_classes in marshmallow_dataclass correctly uses generic arguments as keys, so B[int] and B[str] would not end up in a nested schema field (meaning it will miss the seen_classes cache). The problem is with marshmallow, which also has a class registry, and it doesn't know about generic arguments. So for this bug to occur, we would need marshmallow_dataclass.class_schema to create a Nested type for a field, and it only does it if it sees two identical types, including generic arguments. However on deserialisation, marshmallow treats this Nested field type to have a type of the first seen generic variant of the class, because the key it uses for its internal cache (i.e class_registry) does not have the generic arguments of the type it is deserialising.

it goes something like this:

# comments explain the field schemas `class_schema` derives from the dataclass
class A:
  a: B[str]  # field schema correctly created as B[str]
  b: B[int]  # field schema correctly created as B[int]
  c: B[int]  # B[int] is in `seen_classes`, field schema is Nested(B)

# when we do this:
marshmallow_dataclass.class_schema(A)().load(a_object)
# marshmallow does:
# load a first, add its type `B` to class_registry, so the registry has the mapping `B -> schema of B[str]`
# load b, use given field schema, which is for B[int]
# load c, this has a Nested(B) schema, and the deserialiser stored in class_registry for B is for B[str], so this fails


# when we reorder fields like this:
class A:
  b: B[int] # field schema correctly created as B[int]
  c: B[int] # B[int] is in `seen_classes`, field schema is Nested(B)
  a: B[str] # field schema correctly created as B[str]

# now when we do this:
marshmallow_dataclass.class_schema(A)().load(a_object)
# marshmallow does:
# load b first, add its type `B` to class_registry, so the registry has the mapping `B -> schema of B[int]`
# load c, this has a Nested(B) schema, and the deserialiser stored in class_registry for B is for B[int], so this succeeds
# load a, use given field schema, which is for B[str]

Does the same issue pertain?

I don't think it will because none of these classes would have a field schema of Nested. That is when marshmallow's class registry is read as far as I know

What would fixing that involve?

Some thoughts:

  • do not use nested field schema's if the value has generic arguments, but that would break recursive generic types
  • fix marshmallow class_registry upstream so it does respect generic args
  • support recursive types but still do not use nested schemas, but limit the level of type recursion

onursatici avatar Dec 19 '22 13:12 onursatici

@onursatici

Also this implementation has a bug for repeated use of the same generic dataclasses.

Aha!

If I understand correctly, here's what's going on.

Currently, we store the class name in _RECURSION_GUARD.seen_classes, keyed by type. I.e. for B[int] we set seen_classes[B[int]] = "B", and for B[str] we set `seen_classes[B[str]] = "B" as well. (See here.)

The marshmallow.fields.Nesteds get instantiated here. For each given type (e.g. B[int]):

  • The first time through, nested gets set to the result of calling _internal_class_schema. This is a Marshmallow schema class object [ed: class object not instance] — and is the correct schema for the given type. Then field_for_schema returns Nested(nested), which is correct.
  • The second time through for the same type, nested gets set to _RECURSION_GUARD.seen_classes[B[int]], which is the string "B". Then field_for_schema returns Nested("B"). Here's where the incorrect cached value is being returned — Nested looks up the schema from Marshmallow's class registry using the key "B".

Maybe the fix is to somehow cache the schema instance (rather than or in addition to schema class name) in the _RECURSION_GUARD thread-local. (Or maybe cache the Marshmallow field instance?)


Also of note, is these problem cases seem to work in older python versions. (Or at least behave differently.) I think this has to do with this use of repr(clazz) rather than clazz.__name__ to generate the seen_classes key for python < 3.10. Note B[int].__name__ is "B", while repr(B[int]) is something like "__main__.B[int]".

dairiki avatar Dec 20 '22 00:12 dairiki

I think this has to do with this use of repr(clazz) rather than clazz.__name__ to generate the seen_classes key for python < 3.10.

So the name under which a schema gets entered in Marshmallow's class_registry comes from the name specified when the schema class is constructed here. That is whatever "name" we picked up here, which is the same name that gets remembered in _RECURSION_GUARD.seen_classes. That "name" is, e.g., "B" in python >= 3.10 and "<module>.B[int]" in earlier pythons — the later (usually) works, since it includes the generic parameter types, the former doesn't.

So, just including the generic type parameters within the "name" string is enough to mostly fix the issue. I still think it's probably a better fix to actually cache the schema class for the type rather than just a "name" of that schema class.

@onursatici I can look at this further to see what a clean fix might be, but I likely won't have time to get to that until after New Year's sometime. (In the meantime, if you figure out something, have at it!)

dairiki avatar Dec 20 '22 02:12 dairiki

@dairiki Thanks for having a look! Yea caching the schema class makes sense to me. I can have a go at it, hopefully soon if I can make the time, will let you know if I can't so we don't do double work

onursatici avatar Dec 22 '22 01:12 onursatici

Currently, this PR works, I think, if the dataclass fields have types that are one of:

  • one of the generic type parameters
  • a generic dataclass (with generic args passed on)
  • a non-generic type

It does not appear to work when fields have other generic types, e.g.

T = TypeVar("T")

@dataclasses.dataclass
class Z(Generic[T]):
    z: List[T]

class_schema(Z[int])  # -> TypeError("T is not a dataclass and cannot be turned into one.")

Since we are already passing generic_params_to_args to field_for_schema, I suspect this is fixable (though I haven't tried to do so.)

eta: also, what about:

@dataclasses.dataclass
class ZZ(Generic[T]):
    z: List[Tuple[T, T]]
    t: Z[Tuple[T, T]]

dairiki avatar Jan 12 '23 01:01 dairiki

If generic parameters are not specified for a generic dataclass field types, should we assume Any (the same way we do for List and other built-in generics (in _generic_type_add_any))?

E.g., with the current state of this PR:

T = TypeVar("T")

@dataclasses.dataclass
class GenericDataclass(Generic[T]):
    x: T

@dataclasses.dataclass
class Bar:
    x: GenericDataclass[int]
    y: List  # (interpreted as List[Any])

class_schema(Bar) # this works

@dataclasses.dataclass
class Foo:
    fu: GenericDataclass

class_schema(Foo)  # throws => TypeError("T is not a dataclass and cannot be turned into one.")

dairiki avatar Jan 12 '23 17:01 dairiki

Maybe the fix is to somehow cache the schema instance (rather than or in addition to schema class name) in the _RECURSION_GUARD thread-local. (Or maybe cache the Marshmallow field instance?)

As the name (RECURSION_GUARD) suggests, we can't do exactly this, since the whole point here is to be able to deal with recursive schemas. At the time we create the Nested field, we have not necessarily finished compiled the schema for the nested type.

One way around this is that marshmallow.fields.Nested can accept a callable for it's nested parameter, thus allowing for passing deferred values.

I've forked your onursatici:os/generic-dataclasses branch and, I think fixed this issue here in dairiki:generic-dataclasses. In particular, see 024375ffcdf4517471eacb68a52ce24b7ce85cc1.

dairiki avatar Jan 12 '23 20:01 dairiki

Implements #230

dairiki avatar Jan 13 '23 15:01 dairiki

Thanks for having a look, it looks neat. I think I can have a look at this next week, and add the tests for generic types other than dataclasses. Or If you wish you can also take this PR over, both work for me

onursatici avatar Jan 21 '23 09:01 onursatici

@onursatici So I've been drawn down the rabbit-hole and have started a larger refactor of the marshmallow_code.

I've just created a draft PR for that (with a few notes) at #232.

The work there pulls in the work here, as well as loads of other stuff.

It also fixes things to work for

@dataclasses.dataclass
class ZZ(Generic[T]):
    z: List[Tuple[T, T]]
    t: Z[Tuple[T, T]]

as noted above by moving the resolution of TypeVars to field_for_schema.


If you're in a hurry to have support for generic dataclasses, I say we should fix this PR and merge it (I think it's almost there) rather than wait for #232. But if you're not in a big rush, let's wait.

dairiki avatar Jan 21 '23 23:01 dairiki

ah I see, yea that makes sense. Sure I can wait, thanks for doing this

onursatici avatar Jan 22 '23 09:01 onursatici

We would love for this to go in soon! Our use case is to have a generic paginated response shape that can be marshaled out to JSON

Something like:

import typing
from dataclasses import dataclass


Result = typing.TypeVar("Result")


@dataclass
class PaginatedResult(typing.Generic[Result]):
    results: list[Result]
    next_page_token: typing.Optional[str]

shooit avatar Jun 27 '23 16:06 shooit