dataclass-wizard
dataclass-wizard copied to clipboard
[wiz-cli] the `wiz` tool shouldn't use any duplicate dataclass names
- Dataclass Wizard version: 0.22.1
- Python version: 3.10
- Operating System: Mac OS
Description
I found an interesting case where the wrong dataclass schema is generated with the wiz
command-line tool. The schema is wrong insofar as that the same dataclass name - Ability
in this case - is reused between generated dataclasses in a nested dataclass structure. This obviously won't allow us to actually load JSON data to the main dataclass Data
, because the nested schema has a bug (duplicate name for dataclass) as indicated.
See output below for more details.
What I Did
I ran wiz gs
from my terminal, with the following JSON data as input:
echo '{"id": 1,
"height": 7,
"name": "bulbasur",
"abilities": [
{
"ability": {
"name": "overgrow",
"url": "https://pokeapi.co/api/v2/ability/65/"
},
"is_hidden": false,
"slot": 1
},
{
"ability": {
"name": "chlorophyll",
"url": "https://pokeapi.co/api/v2/ability/34/"
},
"is_hidden": true,
"slot": 3
}
]
}' | wiz gs -x
The result generated the following nested dataclass schema:
from __future__ import annotations
from dataclasses import dataclass
from dataclass_wizard import JSONWizard
@dataclass
class Data(JSONWizard):
"""
Data dataclass
"""
id: int
height: int
name: str
abilities: list[Ability]
@dataclass
class Ability:
"""
Ability dataclass
"""
ability: Ability
is_hidden: bool
slot: int
@dataclass
class Ability:
"""
Ability dataclass
"""
name: str
url: str
As you can already see, the Ability
dataclass name is duplicated. This results in the second class definition overwriting the previous one, which is what we would like to avoid in this case.
I've then tried the following Python code with the generated class schema above:
j = '{"id":1,"height":7,"name":"bulbasur","abilities":[{"ability":{"name":"overgrow","url":"https://pokeapi.co/api/v2/ability/65/"},"is_hidden":false,"slot":1},{"ability":{"name":"chlorophyll","url":"https://pokeapi.co/api/v2/ability/34/"},"is_hidden":true,"slot":3}]}'
instance = Data.from_json(j)
Which resulted in an error, as expected, since the second dataclass definition overwrites the first one:
error: Ability.__init__() missing 2 required positional arguments: 'name' and 'url'
Renaming the second class to something else, for example Ability2
, and then updating any references to it, then works as expected to load the JSON input data to a dataclass instance.
Resolution
One possible approach that comes to mind, is to keep a hashset of all generated class names - or better yet a dict
mapping of class name to a count of its occurrences, or how many times we've seen it - maybe at the module or global scope, or perhaps recursively pass it in through each function in the generation process.
Then check if the class name we want to add is already in the mapping, and if so we add some random suffix (for example increment by the count of how many times we've already seen the class name) to the generated class name; if not, we use the generated class name as-is. In any case, we then increment the count of the class name in the mapping, to indicate that we've seen it previously.
I've run into the same issue, with the added complication that the same key is reused to represent several objects:
{
"data": {
"citations": {
"data": [
{
"id": "10.1093/bioinformatics/btu153",
"type": "dois"
},
{
"id": "10.1038/nbt.4163",
"type": "dois"
}
]
},
"client": {
"data": {
"id": "doe.lbnl",
"type": "clients"
}
},
"media": {
"data": {
"id": "10.25982/26409.220/1820944",
"type": "media"
}
},
"partOf": {
"data": []
},
}
}
The data
key is being used here as a container for the whole API response, as an object with keys id
and type
, and as a container for anonymous objects with keys id
and type
. The data structure above is particularly tricky as it's not clear whether the value of the data
object automatically gets turned into a list if there are several (see data["citations"]["data"]
vs data["client"]["data"]
vs data["partOf"]["data"]
Possible approach:
- have two dataclass generation modes, conservative and DRY
- in conservative mode, a separate dataclass is generated for each instance of the duplicated key
- in DRY mode, the wizard will attempt to reuse class names
- when parsing, collect instances of the generated class names and the data structures that they represent
- if the data structures beneath the keys match, reuse the class name -- e.g. in the data structure above,
data["client"]["data"]
anddata["attributes"]["media"]["data"]
- otherwise, create a new class name
The definitive way to work out the class structures would be to have access to the schema (which, unfortunately, the API does not provide) but this would provide a reasonable estimation.