dataclass-wizard Add UnionChecker to Optimize `Union or |` Type Deserialization for Common Python Types (e.g., `int`, `Enum`, `Literal`)

Add UnionChecker to Optimize `Union or |` Type Deserialization for Common Python Types (e.g., `int`, `Enum`, `Literal`)

Open rnag opened this issue 11 months ago • 0 comments

Adding my thoughts here in the form of pseudo-code. I think one approach can be to create a class UnionChecker or similar in loaders.py or a new module. It would be modeled similar to LoadMixIn and have a method for each common type that would be in a Union or | in a type annotation, such as int, float, enum, datetime, ... and so on.

Again, pseudo-code of how that could look:

class UnionChecker:

  def parse_int_if_valid(o: int | str | float) -> int | None:
    if type(o) is int: return o
    if type(o) is float: return int(round(o))
    if type(o) is str: return int(o) if o.removeprefix('-').isdigit() else None
    # catch-all return statement
    return None

  def parse_enum_if_valid(o: Any) -> Enum | None:
    if type(o) is Enum Class: return o
    return enum_class(o) if o is a valid value for Enum Class else None
    # catch-all return statement
    return None

Then, in parsers.py where we call Parser.__contains__ in UnionParser.__call__:

class UnionParser:
    ...
    def __call__(self, o: Any) -> Optional[T]:
        ...
        for parser in self.parsers:
   >        if o in parser:
                return parser(o)

We can replace that check with something like:

for parser in parsers:
  possible_value = parse_to_type_if_valid(o, base_type)
  if possible_value is not None:
    return possible_value

This should give a good compromise b/w efficiency and also fix the parsing support for Union, which is currently not fully supported in this library, and which I would like to address.

Some Benchmarks

Also, just in case someone tells me I'm worried about micro-optimizations which is possible, I made a small benchmark to test validating vs blind parsing for common types like Enum and int:

from enum import Enum
from timeit import timeit


class MyEnum(Enum):
    VAL_1 = 'ONE PLEASE'
    VAL_2 = 'TWO IS GREAT'
    VAL_3 = "THREE'S AWESOME"
    VAL_4 = 97


def is_enum1(o):
    try:
        _ = MyEnum(o)
        return True
    except ValueError:
        return False


def is_enum2(o):
    return o in MyEnum._value2member_map_


def is_int1(o):
    try:
        _ = int(o)
        return True
    except ValueError:
        return False


def is_int2(o: str):
    return o.removeprefix('-').isdigit()


def is_float1(o):
    try:
        _ = float(o)
        return True
    except ValueError:
        return False


def is_float2(o: str):
    return o.removeprefix('-').replace('.', '', 1).isdigit()


n = 1_000


print(timeit('is_enum1("TWO IS GREAT")', globals=globals(), number=n))
print(timeit('is_enum2("TWO IS GREAT")', globals=globals(), number=n))
print(timeit('is_float1("54.321")', globals=globals(), number=n))
print(timeit('is_float2("54.321")', globals=globals(), number=n))
print(timeit("is_int1('123')", globals=globals(), number=n))
print(timeit("is_int2('123')", globals=globals(), number=n))
# 0.00019675004296004772
# 4.5042019337415695e-05
# 6.279093213379383e-05
# 8.941697888076305e-05
# 8.983299994724803e-05
# 4.8417001380585134e-05
print()
print(timeit('is_enum1("OOPS!!")', globals=globals(), number=n))
print(timeit('is_enum2("OOPS!!")', globals=globals(), number=n))
print(timeit('is_float1("OOPS!!")', globals=globals(), number=n))
print(timeit('is_float2("OOPS!!")', globals=globals(), number=n))
print(timeit("is_int1('1.abc')", globals=globals(), number=n))
print(timeit("is_int2('1.abc')", globals=globals(), number=n))
# 0.0014168331399559975
# 3.700004890561104e-05
# 0.00046787504106760025
# 6.270897574722767e-05
# 0.000634250001894543
# 4.758400245918892e-05

assert is_enum1("OOPS!!") is is_enum2("OOPS!!") is False
assert is_enum1("TWO IS GREAT") is is_enum2("TWO IS GREAT") is True
assert is_float1("OOPS!!") is is_float2("OOPS!!") is False
assert is_float1("54.321") is is_float2("54.321") is True
assert is_int1('123') is is_int2('123') is True
assert is_int1('1.abc') is is_int2('1.abc') is False

Results show that validating values for types like Enum and int is an overall better approach. There are some types like float, for which it might be better to try parsing using float constructor, but overall validating the value/type first seems a solid approach to have.

Hope this helps - I currently don't have time to implement this myself, which is why I'm jotting my thoughts down, in case anyone wants to take a stab at it. I'm also going to update the issue with "help wanted" label. For now, I'm going to focus on addressing other issues with the Dataclass Wizard library, but I'll keep a tab on this issue and will hopefully be back to look at it with fresh eyes sometime soon.

Originally posted by @rnag in https://github.com/rnag/dataclass-wizard/issues/67#issuecomment-2455610700

Dec 08 '24 15:12 rnag

dataclass-wizard dataclass-wizard copied to clipboard

Add UnionChecker to Optimize `Union or |` Type Deserialization for Common Python Types (e.g., `int`, `Enum`, `Literal`)

Some Benchmarks

dataclass-wizard
dataclass-wizard copied to clipboard