marshmallow_dataclass icon indicating copy to clipboard operation
marshmallow_dataclass copied to clipboard

When type hint is not recognized (e.g. `pd.Timestamp`), we recursively try to convert into a dataclass which mutates the constructor throughout the codebase

Open evanfwelch opened this issue 5 years ago • 11 comments

Reproducible error:

import pandas as pd
import dataclasses

import marshmallow_dataclass

@dataclasses.dataclass
class Something:
    date_i_care_about: pd.Timestamp

print(f"here's a timestamp {pd.Timestamp('2017-01-01')}")

# this mutates the constructor for pd.Timestamp by recursively trying to convert to dataclass
marshmallow_dataclass.class_schema(Something)

print(f"here's a timestamp {pd.Timestamp('2017-01-01')}")

versions: pandas 0.23.4 marshmallow 2.x marshmallow_dataclass: 0.6.6

@lovasoa what is the intution behind trying to convert clazz to dataclass here after we fail to get the fields?

https://github.com/lovasoa/marshmallow_dataclass/blob/master/marshmallow_dataclass/init.py#L281

I feel that an error (or some kind of strict mode) would be more helpful

evanfwelch avatar Nov 06 '19 20:11 evanfwelch

marshmallow_dataclass allows you to have simple classes, like :

class A:
    a: int

class B:
    b: A

marshmallow_dataclass.class_schema(B)

In order to create a schema for B, it also has to create a schema for A. And in order to apply dataclasses.fields to A, it has to make it a dataclass: https://github.com/lovasoa/marshmallow_dataclass/blob/master/marshmallow_dataclass/init.py#L281

I don't think it's possible to fix this issue without breaking backward compatibility... What behavior would you expect for your example above ?

lovasoa avatar Nov 07 '19 08:11 lovasoa

If you want a custom type in your class, the best way is to use a NewType declaration

lovasoa avatar Nov 07 '19 11:11 lovasoa

Ok I see your point. In my (and my colleagues) use case, we would already have made both A and B a dataclass (because we want all the benefits of dataclasses in our application code), and then when it comes to serializing an instance of B, we'd generate a marshmallow schema by doing:

marshmallow_dataclass.class_schema(B).dump(...)

However I guess you've primarily designed for the case where folks are using the marshmallow_dataclass.dataclass decorator to do both at once....

What if we modified the logic for checking whether B should be converted into dataclass to check if it has already has an __init__ defined. I can't think of many cases where someone has a class with a constructor and wants to make it a dataclass

evanfwelch avatar Nov 07 '19 16:11 evanfwelch

I think that would be acceptable...

lovasoa avatar Nov 07 '19 16:11 lovasoa

In order to avoid breaking compatibility, I am simply adding a new warning that will be displayed before marshmallow_dataclass tries to alter a class :

UserWarning: marshmallow_dataclass was called on the class Path, which is not a dataclass. It is going to try and convert the class into a dataclass, which may have undesirable side effects. To avoid this message, make sure all your classes and all the classes of their fields are either explicitly supported by marshmallow_datcalass, or are already dataclasses. For more information, see https://github.com/lovasoa/marshmallow_dataclass/issues/51

To anyone who would end up here after seeing this warning: to fix this , you should make sure that all the fields of all your classes are either simple base types, or dataclasses themselves.

For instance, the following is good :

import marshmallow_dataclass
import dataclasses

@dataclasses.dataclass
class A:
    a: int

class B:
    b: A

marshmallow_dataclass.class_schema(B)

and the following is bad :

import marshmallow_dataclass

class A:
    a: int

class B:
    b: A

marshmallow_dataclass.class_schema(B)

and this is the worse, because marshmallow_dataclass will change the Path class, and break it globally in your whole application :

from pathlib import Path

class B:
    b: Path

marshmallow_dataclass.class_schema(B)

lovasoa avatar Oct 10 '20 14:10 lovasoa

Hi all, I think this is quite a serious side effect, for example, it breaks JSON encoding of classes that are handled by a custom JSON encoder, because they are now considered to be dataclasses so they are converted natively.

Can you add an even bigger warning? I didn't actually notice it because it's printed with other messages at flask startup. Also, the warning could contain a suggestion to fix it by adding a marshmallow_field to the field metadata, because that's ultimately the only way to support fields with custom non-dataclass classes, right?

prihoda avatar Mar 30 '21 09:03 prihoda

@prihoda : Could you make a PR with the changes you want to see on the message ? We'll discuss the changes there.

lovasoa avatar Mar 30 '21 10:03 lovasoa

I got burned by this one trying to use pathlib.Path as a type. Seeing as pathlib is seeing increased usage is there any way to have this type treated correctly? As a workaround I can make a base class like this:

class _HasPaths:
    """Workaround to ensure we get ``pathlib.Path`` types where needed."""

    def __post_init__(self) -> None:
        for fld in fields(self):

            if fld.type == Path:
                attr = getattr(self, fld.name)
                setattr(self, fld.name, Path(attr))

and have any dataclass that uses Path as a type inherit from it but this is less than ideal when pathlib.Path should probably be a type that's handled by default.

The warnings introduced in #130 were helpful in figuring this out but for several hours I was confounded by a bunch of pytest INTERNALERROR messages that were pretty impenetrable.

mivade avatar May 26 '21 15:05 mivade

I think this is a request for marshmallow itself.

lovasoa avatar May 26 '21 17:05 lovasoa

I think this auto-dataclassification is too much magic, and likely to create hard-to-suss bugs.

Note that the dataclass decorator does not construct and return a new class, rather, it munges the decorated class in-place. So current behavior results in mangling other peoples code in a hidden, fairly magic, manner.

Here's a possible fix.

====

I think we can handle "simple annotated classes"(*) as in the above example without converting them to dataclasses.

For the sake of concreteness, let's say "simple annotated classes" are defined as objects, cls, for which:

  1. isinstance(cls, type)
  2. cls.__init__ is object.__init__
  3. cls.__new__ is object.__new__
  4. typing.get_type_hints(cls) returns non-empty

(This definition could perhaps be tweaked/widened a bit.)

Our class_schema currently uses the fact that its argument is a dataclass for the following purposes.

  1. We currently use dataclasses.fields to get at:

    a. field names — for "simple classes" we can get these from typing.get_type_hints instead b. field default values — for "simple classes", the value of the class variable (if any) will do c. field metadata — we can safely assume there is no metadata for simple classes

  2. We use the dataclasses constructor to construct the return value in the load method of our generated schemas. Instead, for "simple classes", we can construct them manually: create an instance by calling their __new__, then manually initialize its attributes.

So the proposed fix is to change class_schema so that:

  1. If the argument is a dataclass, do like we do now.
  2. Otherwise, if the argument is a "simple annotated class", construct a schema using the alternate methods described above (notably, without mangling the argument into a dataclass).
  3. Otherwise, raise a TypeError("do not know how to construct schema for X")

dairiki avatar Jan 15 '23 17:01 dairiki

  1. Otherwise, raise a TypeError("do not know how to construct schema for X")

I agree that the library should not be optimistically trying to create dataclasses out of types it does not understand. I didn't even realise the library could do that until today. Raise an error and let the user wrap with @dataclass themselves - it's so trivial.

Today I lost several hours to Path getting clobbered and not getting the warning on the console because of reasons.

I couldn't even debug it properly because the debugger was crashing trying to use the clobbered type to show values.

Maybe to keep backwards compatibility there could be some kind of strict mode flag / settable module option?

phyrwork avatar Oct 16 '23 19:10 phyrwork