datamodel-code-generator
datamodel-code-generator copied to clipboard
Split output in individual .py files
Describe the solution you'd like I would like an option to be able to output 1 .py file per model, instead of having them all in the same file. Ideally with then a init.py file that imports them all.
Describe alternatives you've considered Post-generation treatment of the file (somehow) to split it, but that looks complex
Second this.
For example, the GitHub API responses surface is absolutely enormous and very fragmented (repository is one of the core concepts of course, but different endpoints return slightly different versions):
$ python --version
Python 3.11.2
$ datamodel-codegen --version
0.21.4
$ datamodel-codegen --openapi-scope paths --url https://raw.githubusercontent.com/github/rest-api-description/v2.1.0/descriptions/api.github.com/dereferenced/api.github.com.deref.json > api.py
[warnings redacted]
$ wc -l api.py
94887 api.py
$ grep -P 'class Repository\d*\(BaseModel\)' api.py | wc -l
90
So we have Repository all the way through to Repository89. That is sadly too unwieldy. Having this many duplicates or nigh-duplicates would be fine if they'd be namespaced.
One workaround I am in the process of exploring is something along the lines of:
jq '.paths["/orgs/{org}"]["get"]["responses"]["200"]["content"]["application/json"]["schema"]' api.github.com.2022-11-28.deref.json > org.json
For this, download the REST OpenAPI description.
Now org.json contains a workable subset. If we mirror that to the local file system, replacing variables like {org} with, say ORG, we could get a workable solution:
$ datamodel-codegen --input org.json --output src/github/api/orgs/ORG/__init__.py
Now, inside the src Python package, one can issue:
from src.github.api.orgs.ORG import OrganizationFull as Organization
So a workaround can look like this (haven't implemented yet):
- Fetch full Open API spec from GitHub (41 MB)
- for all paths (view with
jq '.paths | keys' api.json), extract its schema (if nogetand200keys available, skip, I guess, if we're only querying) - save as
schema.jsonto some new path mirroring the API one (which is nice as it makes it easy to match with the docs, and makes it google-able); make it filesystem- and more importantly, Pythonimport-friendly (remove{etc.) find . -name 'schema.json' -print0 | xargs --null -I '~' datamodel-codegen --input ~ --output parent_dir(~)/__init__.py(parent_diris pseudo-code, didn't have the patience to figure this out (why is bash so hard here, file paths are its bread and butter...))
Now you got a hierarchical, importable, structured tree of Python modules, with hopefully as little ModelName\d+ hits as possible. Can be regenerated fully at will as well. Symbols can be renamed at will via import X as Y
I implemented the above idea. It took more lines than expected, so it lives in a gist:
https://gist.github.com/alexpovel/00ab28e4815a905d4e0407c4932f9988
The module/script docstring explains everything. There's also a usage example shell script. The script is specific to the GitHub OpenAPI specification, but can be adjusted easily, I hope. The script is a hot hack on top of datamodel-codegen and will one day hopefully be obsolete (it doesn't only generate a single-path model file but also fixes a few bugs I happened to come across; your use case might be different and require other fixes, if any).
Pasting both below for convenience.
Main
#!/usr/bin/env python3
# https://peps.python.org/pep-0722/
# Script Dependencies:
#
# datamodel-code-generator==0.21.1 ; python_version >= "3.10" and python_version < "4.0"
# black==23.7.0 ; python_version >= "3.10" and python_version < "4.0"
# pydantic==2.1.0 ; python_version >= "3.10" and python_version < "4.0"
"""From an OpenAPI specification in JSON format, generate a pydantic data model for a
given URL path inside that spec (only).
For example, for a spec containing the `"paths"` key
`/repos/{owner}/{repo}/actions/runs`, it will place the generated pydantic models at
`$OUT_DIR/repos/OWNER/REPO/actions/runs/__init__.py`, ready to be imported. The output
root can be modified via `--out-dir`.
See https://github.com/koxudaxi/datamodel-code-generator/issues/1170 for why this script
can be useful (`datamodel-codegen` generates a *single* file out of a given spec, which
can get unwieldy for large specs).
This script contains special-cased code specific to the GitHub API that you might want
to delete.
Currently only works for GET queries returning HTTP 200 responses on success. Other HTTP
methods and status codes are not supported, but easy to add.
"""
import argparse
import ast
import json
import logging
import re
import typing as t
from functools import partial
from http import HTTPMethod, HTTPStatus
from pathlib import Path, PosixPath
from datamodel_code_generator import (
DataModelType,
InputFileType,
PythonVersion,
generate,
)
logging.basicConfig(level=logging.INFO)
def sanitize_for_python_import_use(path: PosixPath) -> Path:
"""Sanitize a path for use as a Python import.
Abusing file system paths here, assuming that any passed URL path is compatible. All
we want is type-safe splitting at `/` anyway.
Imported name parts *must* be valid Python identifiers, which this function affords.
See `dotted_name` of https://docs.python.org/3/reference/grammar.html .
"""
logging.info(f"Sanitizing path '{path}'")
assert path.is_absolute(), (
f"Path '{path}' not absolute, as required by OpenAPI spec"
+ " (https://github.com/OAI/OpenAPI-Specification/blob/9dff244e5708fbe16e768738f4f17cf3fddf4066/schemas/v3.0/schema.json#L793)"
)
if str(path) == path.root: # Terminating base case
return path
name = path.name
def replace(match: re.Match[str]) -> str:
"""Replace a match with its first group, uppercased.
Useful to not uppercase the *entire* string.
"""
return match.group(1).upper()
# Deal with `{foo}` -> `FOO`, common for variable paths in OpenAPI specs
name = re.sub(r"\{(.*)\}", replace, name)
# Any remaining non-word characters -> `_`
name = re.sub(r"[^\w]", "_", name)
# Leading $DIGITS -> `_$DIGITS`
name = re.sub(r"^(\d)", r"_\1", name)
logging.info(f"Sanitized path name to: '{name}'")
assert name.isidentifier(), f"Name '{name}' not a valid identifier after cleaning"
return sanitize_for_python_import_use(path.parent) / name
def general_fixup(tree: ast.Module, *, class_renames: dict[str, str]) -> ast.Module:
"""Fixes up code generated by `datamodel-codegen`.
This should only be necessary for a brief period of time, until the below is fixed
upstream.
This uses `ast` which drops comments, whitespace and other formatting. Use LibCST
(https://libcst.readthedocs.io/en/latest/) for concrete syntax trees able to
preserve these elements. `ast` was used for simplicity.
"""
import ast
from pydantic import Field
def convert_single_example_value_to_examples_list(tree: ast.Module) -> ast.Module:
"""Converts `example` values to `examples` lists.
Despite specifying `DataModelType.PydanticV2BaseModel`, `datamodel-codegen` will
generate calls like `Field(example="hello")`, which in pydantic v1 was allowed
(as `Any` `kwarg` was):
https://github.com/pydantic/pydantic/blob/v1.10.12/pydantic/fields.py#L249
but in pydantic v2 is fixed:
https://github.com/pydantic/pydantic/blob/v2.1.1/pydantic/fields.py#L672
aka calls should now look like `Field(examples=["hello"])`.
"""
for node in ast.walk(tree):
match node:
case ast.AnnAssign(
annotation=ast.Subscript(
slice=ast.Tuple(
dims=[
_,
ast.Call(
func=ast.Name(id=Field.__name__), keywords=keywords
),
]
)
)
):
for kw in keywords:
match kw:
case ast.keyword(arg="example", value=example):
kw.arg = "examples"
kw.value = ast.List(elts=[example], ctx=ast.Load())
case _:
pass
case _:
pass
return tree
def rename_classes(tree: ast.Module, renames: dict[str, str]) -> ast.Module:
"""Renames classes in the AST.
`datamodel-codegen` generates class names automatically from the OpenAPI
description, where they might not be desirable names (`OrganizationFull` instead
of just `Organization`).
"""
keys = set(renames.keys())
for node in ast.walk(tree):
match node:
# Class definition itself
case ast.ClassDef(name=name) if name in renames:
node.name = renames[name]
keys.remove(name)
# Class references
case ast.Name(id=name) if name in renames:
# Relevant for cases like:
#
# ```python
# import ast
#
# print(ast.dump(ast.parse('def x(a: SomeClass): pass'), indent=4))
# print(ast.dump(ast.parse('x: list[SomeClass] = [3]'), indent=4))
# print(ast.dump(ast.parse('SomeClass(x=3)'), indent=4))
# ```
node.id = renames[name]
case _:
pass
if keys:
logging.error(f"Class renames not applied (name not found) for: {keys}")
return tree
tree = convert_single_example_value_to_examples_list(tree)
tree = rename_classes(tree, class_renames)
return tree
def github_fixup(tree: ast.Module, *, path: PosixPath) -> ast.Module:
"""GitHub REST API-specific fixes.
Some ugly, hard-coded special cases. Mainly to iron out what are probably bugs in
`datamodel-codegen`'s OpenAPI parsing, so this should be temporary.
For example, a JSON schema entry:
```json
"archived_at": {
"type": "string", "format": "date-time", "nullable": true
}
```
ending up as `archived_at: datetime` (no `None`), leading to validation errors.
"""
from datetime import datetime
from pydantic import EmailStr
applied_fixes = {
"archived_at_datetime_not_nullable": False,
"description_string_not_nullable": False,
"emailstr_to_str": False,
}
for node in ast.walk(tree):
match node:
# `path` would be much better as part of the (in that case `tuple`) pattern
# natively instead of guards but mypy failed then, and type narrowing broke
# down.
case ast.AnnAssign(
target=ast.Name(id="archived_at"),
annotation=ast.Name(id=datetime.__name__) as annotation,
) if path == PosixPath(r"/orgs/{org}"):
logging.info("Fixing up `archived_at` for node: %s", ast.dump(node))
node.annotation = ast.BinOp(
left=annotation,
op=ast.BitOr(),
right=ast.Constant(value=None),
)
applied_fixes["archived_at_datetime_not_nullable"] = True
case ast.AnnAssign(
target=ast.Name(id="description"),
annotation=ast.Subscript(
slice=ast.Tuple(elts=[ast.Name(id=str.__name__) as first, *rest]),
),
):
# ANY PATH!
logging.info("Fixing up `description` for node: %s", ast.dump(node))
# Type-narrow manually, else we're not allowed to reach through the
# attributes beyond `node.annotation`. `mypy` should be strong enough to
# do this itself at a future date.
assert isinstance(node.annotation, ast.Subscript)
assert isinstance(node.annotation.slice, ast.Tuple)
# Additionally allowing `None` for the `description` field
# *unconditionally* is not fatal, as it's simply a more conservative
# choice, requiring some `None` checks even if the GitHub API would
# actually never return `None` for that field.
node.annotation.slice.elts = [
ast.BinOp(
left=first,
op=ast.BitOr(),
right=ast.Constant(value=None),
),
*rest,
]
applied_fixes["description_string_not_nullable"] = True
case ast.Name(id=EmailStr.__name__): # ANY PATH!
# `pydantic.EmailStr` uses `email-validator`, which (rightfully?)
# doesn't allow square brackets:
#
# https://github.com/JoshData/python-email-validator/blob/5abaa7b4ce6677e5a2217db2e52202a760de3c24/email_validator/rfc_constants.py#L7
#
# Let's change *all* these occurrences to `str` for now, as the exact
# email format isn't that important.
#
# Breakage was noticed due to GitHub dependabot commits, where the
# author email can be, for example:
#
# ```text
# `49699333+dependabot[bot]@users.noreply.github.com`
# ```
#
# Something something https://news.ycombinator.com/item?id=32671959
node.id = str.__name__
applied_fixes["emailstr_to_str"] = True
case _:
pass
for key, applied in applied_fixes.items():
if not applied:
logging.warning(f"Fix `{key}` not applied for path: {path}")
return tree
def black(code: str) -> str:
"""Format code with black.
`black` doesn't have an API (yet) so this is brittle! See
https://stackoverflow.com/a/76052629/11477374
"""
import black
BLACK_MODE = black.Mode( # type: ignore[attr-defined]
target_versions={black.TargetVersion.PY311}, # type: ignore[attr-defined]
preview=True, # Get experimental features like string formatting/wrapping
)
try:
code = black.format_file_contents(code, fast=False, mode=BLACK_MODE)
except black.NothingChanged: # type: ignore[attr-defined]
pass
finally:
if code and code[-1] != "\n":
code += "\n"
return code
def embellish(code: str, original_spec: dict[t.Any, t.Any]) -> str:
"""Embellish code with comments and docstrings."""
import sys
from datetime import datetime
from textwrap import dedent
tool_directives = [
# Add any required, file-scoped linter directives in here.
#
# Ignore line length:
"ruff: noqa: E501",
# Ignore `pydantic.RootModel` w/o generic args, which is occasionally generated
# by `datamodel-codegen`:
'mypy: disable-error-code="type-arg"',
]
header = dedent(
f"""\
# File generated by command: {' '.join(sys.argv)}
#
# Generated at: {datetime.utcnow().isoformat()}
#
# Do not edit manually.
#
# Original schema this was generated from attached at bottom of file.
"""
)
header = header + "\n" + "\n".join(f"# {line}" for line in tool_directives) + "\n"
footer = "\n".join(
f"# {line}" for line in json.dumps(original_spec, indent=2).split("\n")
)
return header + "\n" + code + "\n" + footer
def main() -> None:
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
description=__doc__,
)
parser.add_argument(
"spec_file",
type=Path,
help="Path to file containing OpenAPI spec."
+ " See for example https://github.com/github/rest-api-description",
)
parser.add_argument(
"path",
type=str,
help=r"URL part from OpenAPI spec to generate code for, e.g. '/orgs/{org}'."
+ " See 'Path Objects' of https://swagger.io/specification/ for details.",
)
parser.add_argument(
"--out-dir",
type=Path,
help="Base directory under which to place generated code."
+ " An OpenAPI path like `/some/path` will be placed at `OUT_DIR/some/path`.",
default=Path("api"),
)
parser.add_argument(
"--rename-class",
help="Rename a class in the generated code. Can be specified multiple times."
+ " Specify as `OldName=NewName`."
+ " Example: `--rename-class Model=SomeMeaningfulName`",
action="append",
)
parser.add_argument(
"-f",
"--force",
action="store_true",
help="Overwrite existing files",
)
args = parser.parse_args()
spec_file = Path(args.spec_file)
path = PosixPath(args.path)
out_dir = Path(args.out_dir)
class_renames = (
{}
if args.rename_class is None
else {
old: new
for old, new in map(
partial(str.split, sep="="),
args.rename_class,
)
}
)
for old, new in class_renames.items():
assert old.isidentifier(), f"Invalid class name: {old}"
assert new.isidentifier(), f"Invalid class name: {new}"
force = args.force is True # Don't rely on truthiness
spec = json.loads(spec_file.read_text())
method = str(HTTPMethod.GET).lower()
status = str(HTTPStatus.OK)
# The following will produce helpful and beautiful enough error messages by itself
# from Python 3.11 on
# (https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-pep657), so just let it
# fail without special key checks.
#
# If the hardcoded `GET`/`200` combo is ever refactored to be more flexible, maybe
# `jq` like query syntax works best ("industry standard" and most flexible for
# users).
schema = spec["paths"][str(path)][method]["responses"][status]["content"][
"application/json"
]["schema"]
output = sanitize_for_python_import_use(path) / "__init__.py"
assert output.is_absolute()
output = output.relative_to(output.root) # 'strip' leading slash
output = out_dir / output
if output.exists() and not force:
raise FileExistsError(
f"Output '{output}' already exists, refusing to overwrite."
+ " Use --force to overwrite."
)
output.parent.mkdir(parents=True, exist_ok=True)
generate(
input_=str(schema),
input_file_type=InputFileType.JsonSchema,
output=output,
output_model_type=DataModelType.PydanticV2BaseModel,
field_constraints=True,
use_field_description=False, # Comments dropped in AST parsing
use_annotated=True,
reuse_model=True,
target_python_version=PythonVersion.PY_311,
use_double_quotes=True,
use_standard_collections=True,
use_union_operator=True,
wrap_string_literal=True,
)
# Syntax-level fixups
tree = ast.parse(output.read_text())
tree = general_fixup(tree, class_renames=class_renames)
tree = github_fixup(tree, path=path)
code = ast.unparse(tree)
# Raw string code-level fixups
code = black(code)
code = embellish(code, schema)
with open(output, "w") as f:
f.write(code)
if __name__ == "__main__":
main()
Usage example
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
VENV_DIR=$(mktemp -d)
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
echo "Using temporary virtual environment at $VENV_DIR"
setup() {
command -v curl || { echo "Need to install curl..." && sudo apt update && sudo apt install --yes curl; }
# Somehow install Python dependencies from `datamodel-codegen-path.py`, which uses
# PEP 722 which isn't supported anywhere yet... so do it manually for demonstration
# 🤷
python3 -m venv "$1"
"$1"/bin/python3 -m pip install datamodel-code-generator pydantic black
}
setup "$VENV_DIR"
[ -f github.json ] || \
curl \
--location \
--output github.json \
https://raw.githubusercontent.com/github/rest-api-description/v2.1.0/descriptions/api.github.com/dereferenced/api.github.com.deref.json
"$VENV_DIR"/bin/python3 \
"${SCRIPT_DIR}/datamodel-codegen-path.py" \
--force \
--rename-class 'Model=OrganizationRepository' \
github.json \
'/orgs/{org}/repos'
CODE=$(cat <<EOF
from api.orgs.ORG.repos import OrganizationRepository
print(OrganizationRepository)
print("Import successful, your setup worked!")
EOF
)
"$VENV_DIR"/bin/python3 -c "$CODE"
I am facing this issue as well.
One thing I found interesting is that months (maybe a year ago), I used datamodel-codegen and got a hierarchy of packages and modules. This one took a json file of "openapi": "3.0.0", as it's input.
Now when I use it, I get a monofile, with many duplicate model names, disambiguated by suffix integers. This one took a json file of "swagger": "2.0", as it's input.
I am not sure if this is due to a change in datamodel-codegen, or the difference in input files.
@koxudaxi how does datamodel-codegen decide if it is going to produce a one large monofile or a structure of pacakges & modules?
@89465127 from my own testing, if schemas have periods/dots in the name foo.bar.Snap, then it requires a folder output and uses separate files. If I replace the periods with underscores or use camelcase, it uses a single output file.