api icon indicating copy to clipboard operation
api copied to clipboard

Encoding and decoding objects such as `ProvenanceDoc`-s via e.g. `as_dict()` and `from_dict()`

Open sgbaird opened this issue 3 years ago • 9 comments

I've been having a bit of a heyday trying to save a DataFrame to a JSON file (or a jsonpickle JSON file) when it includes ProvenanceDoc objects. My workaround right now is just to extract some minimal data from each document, such as references and material_id. Wondering if you have any suggestions.

I'm trying to follow the style of Matbench/Matminer in having my own benchmark dataset stored on figshare and encoding/decoding it. Maybe I'm too hung up on saving a ProvenanceDoc and should stick with extracting what I can easily/manually.

sgbaird avatar Jun 04 '22 05:06 sgbaird

https://pymatgen.org/usage.html#montyencoder-decoder

sgbaird avatar Jun 04 '22 05:06 sgbaird

Object of type CrystalSystem is not JSON serializable
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\monty\json.py", line 321, in default
    d = o.as_dict()

During handling of the above exception, another exception occurred:

  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\monty\json.py", line 336, in default
    return json.JSONEncoder.default(self, o)
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\pandas\io\json\_json.py", line 172, in write
    return dumps(
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\pandas\io\json\_json.py", line 110, in to_json
    s = writer(
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\pandas\core\generic.py", line 2621, in to_json
    return json.to_json(
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\monty\json.py", line 301, in default
    "data": o.to_json(default_handler=MontyEncoder().encode),
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\__init__.py", line 234, in dumps
    return cls(
  File "C:\Users\sterg\Documents\GitHub\sparks-baird\mp-time-split\scripts\data_snapshot.py", line 20, in <module>
    json.dumps(dummy_expt_df, cls=MontyEncoder)

sgbaird avatar Jun 04 '22 05:06 sgbaird

This is something I should be able to fix on my end in emmet-core. I'll report back when I have made the fix and patch released.

munrojm avatar Jun 04 '22 05:06 munrojm

@sgbaird You may know this already but what I tend to do in this case is pass a custom handler to pd.to_json().

from emmet.core.provenance import ProvenanceDoc


def as_dict_handler(obj: object) -> dict[str, Any] | None:
    """Use as default_handler kwarg to json.dump() or pandas.to_json()."""
    try:
        return obj.as_dict()  # all MSONable objects implement as_dict()
    except AttributeError:
        if isinstance(obj, ProvenanceDoc):
            needed_attrs = ("foo", "bar", ...)
            return {k: obj[k] for k in needed_attrs}

        return None  # replace unhandled objects with None in serialized data

df.to_json("some-data.json.gz", default_handler=as_dict_handler)

janosh avatar Aug 13 '22 17:08 janosh

@janosh, interesting. That's new to me. Thanks for the tip!

sgbaird avatar Aug 13 '22 18:08 sgbaird

@munrojm just wondering if there was an update on this issue?

I believe monty dumpfn/loadfn can serialize and de-serialize both pandas DataFrames and pydantic models, but I haven't actually verified both simultaneously. Seems like it'd be a common use case however.

mkhorton avatar Aug 26 '22 20:08 mkhorton