datamodel-code-generator icon indicating copy to clipboard operation
datamodel-code-generator copied to clipboard

`schema` -> `msgspec.Struct` / `typing.TypedDict` converts `List[Any]` to `List`

Open jacopoabramo opened this issue 9 months ago • 2 comments

I'm currently generating a series of msgspec.Structs from previously generated JSON schemas that originate from custom written pydantic models.

An example of such models is the following:

from typing import Any, Dict, List

from pydantic import BaseModel, ConfigDict, Field, RootModel
from typing_extensions import Annotated

class DataFrameForDatumPage(RootModel):
    root: List[str] = Field(alias="Dataframe")


class DatumPage(BaseModel):
    """Page of documents to reference a quanta of externally-stored data"""

    model_config = ConfigDict(extra="forbid")

    datum_id: Annotated[
        DataFrameForDatumPage,
        Field(
            description="Array unique identifiers for each Datum (akin to 'uid' for "
            "other Document types), typically formatted as '<resource>/<integer>'"
        ),
    ]
    datum_kwargs: Annotated[
        Dict[str, List[Any]],
        Field(
            description="Array of arguments to pass to the Handler to "
            "retrieve one quanta of data"
        ),
    ]
    resource: Annotated[
        str,
        Field(
            description="The UID of the Resource to which all Datums in the page belong"
        ),
    ]

The schema -> struct/typed_dict pipeline is the following:

import json
from pathlib import Path

import datamodel_code_generator as generator

# just a file I locally created with the model described above
from datamodel_sandbox.pydantic_model import DatumPage

with open("schema.json", "w") as f:
    f.write(json.dumps(DatumPage.model_json_schema(indent=2), indent=4))


output_paths = [
    Path("msgspec_struct.py"),
    Path("typed_dict.py"),
]

output_model_types = [
    generator.DataModelType.MsgspecStruct,
    generator.DataModelType.TypingTypedDict,
]


for output_path, output_model_type in zip(output_paths, output_model_types):
    generator.generate(
        input_=Path("schema.json"),
        input_file_type=generator.InputFileType.JsonSchema,
        output=output_path,
        output_model_type=output_model_type,
        target_python_version=generator.PythonVersion.PY_38,
        use_schema_description=True,
        use_field_description=True,
        use_annotated=True,
        field_constraints=True,
        wrap_string_literal=True,
        use_double_quotes=True,
        disable_timestamp=True,
    )

The problem is that in the generated Struct/TypedDict, DatumPage.datum_kwargs type is Dict[str, List], while it should instead be Dict[str, List[Any]].

The generated schema is the following ...

{
    "$defs": {
        "DataFrameForDatumPage": {
            "items": {
                "type": "string"
            },
            "title": "DataFrameForDatumPage",
            "type": "array"
        }
    },
    "additionalProperties": false,
    "description": "Page of documents to reference a quanta of externally-stored data",
    "properties": {
        "datum_id": {
            "$ref": "#/$defs/DataFrameForDatumPage",
            "description": "Array unique identifiers for each Datum (akin to 'uid' for other Document types), typically formatted as '<resource>/<integer>'"
        },
        "datum_kwargs": {
            "additionalProperties": {
                "items": {},
                "type": "array"
            },
            "description": "Array of arguments to pass to the Handler to retrieve one quanta of data",
            "title": "Datum Kwargs",
            "type": "object"
        },
        "resource": {
            "description": "The UID of the Resource to which all Datums in the page belong",
            "title": "Resource",
            "type": "string"
        }
    },
    "required": [
        "datum_id",
        "datum_kwargs",
        "resource"
    ],
    "title": "DatumPage",
    "type": "object"
}

... the generated msgspec.Struct ...

# generated by datamodel-codegen:
#   filename:  schema.json

from __future__ import annotations

from typing import Dict, List

from msgspec import Meta, Struct
from typing_extensions import Annotated

DataFrameForDatumPage = Annotated[List[str], Meta(title="DataFrameForDatumPage")]


class DatumPage(Struct):
    """
    Page of documents to reference a quanta of externally-stored data
    """

    datum_id: Annotated[
        DataFrameForDatumPage,
        Meta(
            description=(
                "Array unique identifiers for each Datum (akin to 'uid' for other"
                " Document types), typically formatted as '<resource>/<integer>'"
            )
        ),
    ]
    """
    Array unique identifiers for each Datum (akin to 'uid' for other Document types), typically formatted as '<resource>/<integer>'
    """
    datum_kwargs: Annotated[
        Dict[str, List], # this should be Dict[str, List[Any]]
        Meta(
            description=(
                "Array of arguments to pass to the Handler to retrieve one quanta of"
                " data"
            ),
            title="Datum Kwargs",
        ),
    ]
    """
    Array of arguments to pass to the Handler to retrieve one quanta of data
    """
    resource: Annotated[
        str,
        Meta(
            description=(
                "The UID of the Resource to which all Datums in the page belong"
            ),
            title="Resource",
        ),
    ]
    """
    The UID of the Resource to which all Datums in the page belong
    """

... and typing.TypedDict equivalent.

# generated by datamodel-codegen:
#   filename:  schema.json

from __future__ import annotations

from typing import Dict, List, TypedDict

DataFrameForDatumPage = List[str]


class DatumPage(TypedDict):
    """
    Page of documents to reference a quanta of externally-stored data
    """

    datum_id: DataFrameForDatumPage
    """
    Array unique identifiers for each Datum (akin to 'uid' for other Document types), typically formatted as '<resource>/<integer>'
    """
    datum_kwargs: Dict[str, List] # this should be Dict[str, List[Any]]
    """
    Array of arguments to pass to the Handler to retrieve one quanta of data
    """
    resource: str
    """
    The UID of the Resource to which all Datums in the page belong
    """

I don't exactly know if I'm missing an option in the generation script, if it is an intended behavior or there's something wrong with the pydantic.BaseModel which is used as reference to generate the schema.

EDIT: gave a more detailed description after the comment of @evalott100

jacopoabramo avatar Feb 14 '25 13:02 jacopoabramo

DatumPage.datum_kwargs type is Dict[str, List], while it should instead be Dict[str, List[Any]]. I don't exactly know if I'm missing an option in the generation script or if it is an intended behavior.

Just to add a little context, from jsonschema we generate a series of TypedDicts (and maybe soon msgspec).

We'd like these to pass mypy type-checking with --strict, which x: List (or List[Unknown]) fails.

It'd be very handy if there were an option which would automatically use List[Any]/Dict[Any, Any] in converted fields from schema which don't have any constraints on the object/array elements.

evvaaaa avatar Feb 14 '25 13:02 evvaaaa

This is still pending; since there's no reply yet I may try to squeeze some time into it. Would it be possible to have some input on where to start searching for the problem?

jacopoabramo avatar Jun 05 '25 14:06 jacopoabramo