dataclass-wizard
dataclass-wizard copied to clipboard
Add UnionChecker to Optimize `Union or |` Type Deserialization for Common Python Types (e.g., `int`, `Enum`, `Literal`)
Adding my thoughts here in the form of pseudo-code. I think one approach can be to create a class UnionChecker or similar in loaders.py or a new module. It would be modeled similar to LoadMixIn and have a method for each common type that would be in a Union or | in a type annotation, such as int, float, enum, datetime, ... and so on.
Again, pseudo-code of how that could look:
class UnionChecker:
def parse_int_if_valid(o: int | str | float) -> int | None:
if type(o) is int: return o
if type(o) is float: return int(round(o))
if type(o) is str: return int(o) if o.removeprefix('-').isdigit() else None
# catch-all return statement
return None
def parse_enum_if_valid(o: Any) -> Enum | None:
if type(o) is Enum Class: return o
return enum_class(o) if o is a valid value for Enum Class else None
# catch-all return statement
return None
Then, in parsers.py where we call Parser.__contains__ in UnionParser.__call__:
class UnionParser:
...
def __call__(self, o: Any) -> Optional[T]:
...
for parser in self.parsers:
> if o in parser:
return parser(o)
We can replace that check with something like:
for parser in parsers:
possible_value = parse_to_type_if_valid(o, base_type)
if possible_value is not None:
return possible_value
This should give a good compromise b/w efficiency and also fix the parsing support for Union, which is currently not fully supported in this library, and which I would like to address.
Some Benchmarks
Also, just in case someone tells me I'm worried about micro-optimizations which is possible, I made a small benchmark to test validating vs blind parsing for common types like Enum and int:
from enum import Enum
from timeit import timeit
class MyEnum(Enum):
VAL_1 = 'ONE PLEASE'
VAL_2 = 'TWO IS GREAT'
VAL_3 = "THREE'S AWESOME"
VAL_4 = 97
def is_enum1(o):
try:
_ = MyEnum(o)
return True
except ValueError:
return False
def is_enum2(o):
return o in MyEnum._value2member_map_
def is_int1(o):
try:
_ = int(o)
return True
except ValueError:
return False
def is_int2(o: str):
return o.removeprefix('-').isdigit()
def is_float1(o):
try:
_ = float(o)
return True
except ValueError:
return False
def is_float2(o: str):
return o.removeprefix('-').replace('.', '', 1).isdigit()
n = 1_000
print(timeit('is_enum1("TWO IS GREAT")', globals=globals(), number=n))
print(timeit('is_enum2("TWO IS GREAT")', globals=globals(), number=n))
print(timeit('is_float1("54.321")', globals=globals(), number=n))
print(timeit('is_float2("54.321")', globals=globals(), number=n))
print(timeit("is_int1('123')", globals=globals(), number=n))
print(timeit("is_int2('123')", globals=globals(), number=n))
# 0.00019675004296004772
# 4.5042019337415695e-05
# 6.279093213379383e-05
# 8.941697888076305e-05
# 8.983299994724803e-05
# 4.8417001380585134e-05
print()
print(timeit('is_enum1("OOPS!!")', globals=globals(), number=n))
print(timeit('is_enum2("OOPS!!")', globals=globals(), number=n))
print(timeit('is_float1("OOPS!!")', globals=globals(), number=n))
print(timeit('is_float2("OOPS!!")', globals=globals(), number=n))
print(timeit("is_int1('1.abc')", globals=globals(), number=n))
print(timeit("is_int2('1.abc')", globals=globals(), number=n))
# 0.0014168331399559975
# 3.700004890561104e-05
# 0.00046787504106760025
# 6.270897574722767e-05
# 0.000634250001894543
# 4.758400245918892e-05
assert is_enum1("OOPS!!") is is_enum2("OOPS!!") is False
assert is_enum1("TWO IS GREAT") is is_enum2("TWO IS GREAT") is True
assert is_float1("OOPS!!") is is_float2("OOPS!!") is False
assert is_float1("54.321") is is_float2("54.321") is True
assert is_int1('123') is is_int2('123') is True
assert is_int1('1.abc') is is_int2('1.abc') is False
Results show that validating values for types like Enum and int is an overall better approach. There are some types like float, for which it might be better to try parsing using float constructor, but overall validating the value/type first seems a solid approach to have.
Hope this helps - I currently don't have time to implement this myself, which is why I'm jotting my thoughts down, in case anyone wants to take a stab at it. I'm also going to update the issue with "help wanted" label. For now, I'm going to focus on addressing other issues with the Dataclass Wizard library, but I'll keep a tab on this issue and will hopefully be back to look at it with fresh eyes sometime soon.
Originally posted by @rnag in https://github.com/rnag/dataclass-wizard/issues/67#issuecomment-2455610700