duffel-api-python icon indicating copy to clipboard operation
duffel-api-python copied to clipboard

Suggestion: use Pydantic for data parsing (and type validation)

Open cglacet opened this issue 2 years ago • 3 comments

Is your feature request related to a problem? Please describe. There is no problem involved with this issue, its only a codebase suggestion.

Additional context Maintaining data parsing and validation is kind of painful and not so fun, using a tool like Pydantic might save you some energy (for example you woudln't have to write complete from_json but only provide a way to parse specific parts of it when you have unusual rules).

First step A first step could simply be to use Pydantic's @dataclass instead of the default one without using any of the data validation.

Pros/cons

Pros:

  • simpler parsing code -> no code in most cases (easier to maintain);
  • more complete (dynamic) type checking;
  • more standard way of doing type validation (which again make it easier to maintain);
  • can easily generate openAPI schema, which can help propagate types to frontend;
  • integrates nicely with your linter;
  • benefit from the great work of others, Pydantic is quite fast (V2 was just released and promise great performance improvements)

Cons:

  • you become dependant on Pydantic (but its quite widely used mainly thanks to FastAPI);
  • transition is not that easy (requires strong tests to make sure nothing breaks, but regression testing based on output types should suffice).

Airport example

As an example, take the Airport data class that contain a nested City data class to compare the Pydantic implementation with the current one.

Current implementation

Here is the code you have now (its fully working as is this is why I kept the get_and_transform function in here):

from typing import Optional
from dataclasses import dataclass

@dataclass
class City:
    id: str
    name: str
    iata_code: str
    iata_country_code: str

    @classmethod
    def from_json(cls, json: dict):
        return cls(
            id=json["id"],
            name=json["name"],
            iata_code=json["iata_code"],
            iata_country_code=json["iata_country_code"],
        )


@dataclass
class Airport:
    id: str
    name: str
    iata_code: Optional[str]
    icao_code: Optional[str]
    iata_country_code: str
    latitude: float
    longitude: float
    time_zone: str
    city: Optional[City]

    @classmethod
    def from_json(cls, json: dict):
        return cls(
            id=json["id"],
            name=json["name"],
            iata_code=json.get("iata_code"),
            icao_code=json.get("icao_code"),
            iata_country_code=json["iata_country_code"],
            latitude=json["latitude"],
            longitude=json["longitude"],
            time_zone=json["time_zone"],
            city=get_and_transform(json, "city", City.from_json),
        )
    

def get_and_transform(dict: dict, key: str, fn, default=None):
    try:
        value = dict[key]
        if value is None:
            return value
        else:
            return fn(value)
    except KeyError:
        return default

And here is how it is called:

>>> Airport.from_json(airport_json)
Airport(id='arp_swf_us', name='New York Stewart International Airport', iata_code='SWF', icao_code='KSWF', iata_country_code='US', latitude=41.501292, longitude=-74.102724, time_zone='America/New_York', city=City(id='cit_nyc_us', name='New York', iata_code='NYC', iata_country_code='US'))

Pydantic version

from pydantic import BaseModel

class PydanticCity(BaseModel):
    id: str
    name: str
    iata_code: str
    iata_country_code: str


class PydanticAirport(BaseModel):
    id: str
    name: str
    iata_code: Optional[str]
    icao_code: Optional[str]
    iata_country_code: str
    latitude: float
    longitude: float
    time_zone: str
    city: Optional[City]

And here is how it would called (using the BaseModel.model_validate method):

>>> PydanticAirport.model_validate(airport_json)
PydanticAirport(id='arp_swf_us', name='New York Stewart International Airport', iata_code='SWF', icao_code='KSWF', iata_country_code='US', latitude=41.501292, longitude=-74.102724, time_zone='America/New_York', city=City(id='cit_nyc_us', name='New York', iata_code='NYC', iata_country_code='US'))

Stats for the geeks

The performances of the two validations are as follows:

>>> %timeit Airport.from_json(airport_json)
861 ns ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

>>> %timeit PydanticAirport.model_validate(airport_json)
1.79 µs ± 16.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Note that Pydantic performs a complete validation (each field is type checked) whereas your current code only parse the input data. The comparaison is not made to be fair, I just wanted to highlight that there isn't a huge performance difference between the two (Pydantic is basically twice as slow as you current implementation).

Type errors

>>> airport_json["iata_country_code"] = 12
>>> Airport.from_json(airport_json)
# no error
>>> PydanticAirport.model_validate(airport_json)
ValidationError: 1 validation error for PydanticAirport
iata_country_code
  Input should be a valid string [type=string_type, input_value=12, input_type=int]
    For further information visit https://errors.pydantic.dev/2.0.3/v/string_type

Pydantic dataclasses

from typing import Optional
from pydantic.dataclasses import dataclass


@dataclass
class PydanticCity:
    id: str
    name: str
    iata_code: str
    iata_country_code: str


@dataclass
class PydanticAirport:
    id: str
    name: str
    iata_code: Optional[str]
    icao_code: Optional[str]
    iata_country_code: str
    latitude: float
    longitude: float
    time_zone: str
    city: Optional[City]

From the Pydantic's dataclasses documentation:

Keep in mind that pydantic.dataclasses.dataclass is not a replacement for pydantic.BaseModel. pydantic.dataclasses.dataclass provides a similar functionality to dataclasses.dataclass with the addition of Pydantic validation. There are cases where subclassing pydantic.BaseModel is the better choice.

For more information and discussion see pydantic/pydantic#710.

Disclaimer

I'm not a maintainer of Pydantic, nor I have any sort of participation in it (I think I've never even raised an issue there). I just like it.

cglacet avatar Jul 20 '23 11:07 cglacet