dataclass-wizard icon indicating copy to clipboard operation
dataclass-wizard copied to clipboard

[wiz-cli] the `wiz` tool shouldn't use any duplicate dataclass names

Open rnag opened this issue 2 years ago • 1 comments

  • Dataclass Wizard version: 0.22.1
  • Python version: 3.10
  • Operating System: Mac OS

Description

I found an interesting case where the wrong dataclass schema is generated with the wiz command-line tool. The schema is wrong insofar as that the same dataclass name - Ability in this case - is reused between generated dataclasses in a nested dataclass structure. This obviously won't allow us to actually load JSON data to the main dataclass Data, because the nested schema has a bug (duplicate name for dataclass) as indicated.

See output below for more details.

What I Did

I ran wiz gs from my terminal, with the following JSON data as input:

echo '{"id": 1,
"height": 7,
"name": "bulbasur",
"abilities": [
    {
      "ability": {
        "name": "overgrow",
        "url": "https://pokeapi.co/api/v2/ability/65/"
      },
      "is_hidden": false,
      "slot": 1
    },
    {
      "ability": {
        "name": "chlorophyll",
        "url": "https://pokeapi.co/api/v2/ability/34/"
      },
      "is_hidden": true,
      "slot": 3
    }
  ]
}' | wiz gs -x

The result generated the following nested dataclass schema:

from __future__ import annotations

from dataclasses import dataclass

from dataclass_wizard import JSONWizard


@dataclass
class Data(JSONWizard):
    """
    Data dataclass

    """
    id: int
    height: int
    name: str
    abilities: list[Ability]


@dataclass
class Ability:
    """
    Ability dataclass

    """
    ability: Ability
    is_hidden: bool
    slot: int


@dataclass
class Ability:
    """
    Ability dataclass

    """
    name: str
    url: str

As you can already see, the Ability dataclass name is duplicated. This results in the second class definition overwriting the previous one, which is what we would like to avoid in this case.

I've then tried the following Python code with the generated class schema above:

j = '{"id":1,"height":7,"name":"bulbasur","abilities":[{"ability":{"name":"overgrow","url":"https://pokeapi.co/api/v2/ability/65/"},"is_hidden":false,"slot":1},{"ability":{"name":"chlorophyll","url":"https://pokeapi.co/api/v2/ability/34/"},"is_hidden":true,"slot":3}]}'
instance = Data.from_json(j)

Which resulted in an error, as expected, since the second dataclass definition overwrites the first one:

  error: Ability.__init__() missing 2 required positional arguments: 'name' and 'url'

Renaming the second class to something else, for example Ability2, and then updating any references to it, then works as expected to load the JSON input data to a dataclass instance.

Resolution

One possible approach that comes to mind, is to keep a hashset of all generated class names - or better yet a dict mapping of class name to a count of its occurrences, or how many times we've seen it - maybe at the module or global scope, or perhaps recursively pass it in through each function in the generation process.

Then check if the class name we want to add is already in the mapping, and if so we add some random suffix (for example increment by the count of how many times we've already seen the class name) to the generated class name; if not, we use the generated class name as-is. In any case, we then increment the count of the class name in the mapping, to indicate that we've seen it previously.

rnag avatar May 13 '22 17:05 rnag

I've run into the same issue, with the added complication that the same key is reused to represent several objects:

{
   "data": {
      "citations": {
        "data": [
          {
            "id": "10.1093/bioinformatics/btu153",
            "type": "dois"
          },
          {
            "id": "10.1038/nbt.4163",
            "type": "dois"
          }
        ]
      },
      "client": {
        "data": {
          "id": "doe.lbnl",
          "type": "clients"
        }
      },
      "media": {
        "data": {
          "id": "10.25982/26409.220/1820944",
          "type": "media"
        }
      },
      "partOf": {
        "data": []
      },
   }
}

The data key is being used here as a container for the whole API response, as an object with keys id and type, and as a container for anonymous objects with keys id and type. The data structure above is particularly tricky as it's not clear whether the value of the data object automatically gets turned into a list if there are several (see data["citations"]["data"] vs data["client"]["data"] vs data["partOf"]["data"]

Possible approach:

  • have two dataclass generation modes, conservative and DRY
  • in conservative mode, a separate dataclass is generated for each instance of the duplicated key
  • in DRY mode, the wizard will attempt to reuse class names
  • when parsing, collect instances of the generated class names and the data structures that they represent
  • if the data structures beneath the keys match, reuse the class name -- e.g. in the data structure above, data["client"]["data"] and data["attributes"]["media"]["data"]
  • otherwise, create a new class name

The definitive way to work out the class structures would be to have access to the schema (which, unfortunately, the API does not provide) but this would provide a reasonable estimation.

ialarmedalien avatar Aug 11 '22 16:08 ialarmedalien