Encoding and decoding objects such as `ProvenanceDoc`-s via e.g. `as_dict()` and `from_dict()`
I've been having a bit of a heyday trying to save a DataFrame to a JSON file (or a jsonpickle JSON file) when it includes ProvenanceDoc objects. My workaround right now is just to extract some minimal data from each document, such as references and material_id. Wondering if you have any suggestions.
I'm trying to follow the style of Matbench/Matminer in having my own benchmark dataset stored on figshare and encoding/decoding it. Maybe I'm too hung up on saving a ProvenanceDoc and should stick with extracting what I can easily/manually.
https://pymatgen.org/usage.html#montyencoder-decoder
Object of type CrystalSystem is not JSON serializable
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\monty\json.py", line 321, in default
d = o.as_dict()
During handling of the above exception, another exception occurred:
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\monty\json.py", line 336, in default
return json.JSONEncoder.default(self, o)
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\pandas\io\json\_json.py", line 172, in write
return dumps(
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\pandas\io\json\_json.py", line 110, in to_json
s = writer(
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\pandas\core\generic.py", line 2621, in to_json
return json.to_json(
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\site-packages\monty\json.py", line 301, in default
"data": o.to_json(default_handler=MontyEncoder().encode),
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Users\sterg\miniconda3\envs\mp-time-split\Lib\json\__init__.py", line 234, in dumps
return cls(
File "C:\Users\sterg\Documents\GitHub\sparks-baird\mp-time-split\scripts\data_snapshot.py", line 20, in <module>
json.dumps(dummy_expt_df, cls=MontyEncoder)
This is something I should be able to fix on my end in emmet-core. I'll report back when I have made the fix and patch released.
@sgbaird You may know this already but what I tend to do in this case is pass a custom handler to pd.to_json().
from emmet.core.provenance import ProvenanceDoc
def as_dict_handler(obj: object) -> dict[str, Any] | None:
"""Use as default_handler kwarg to json.dump() or pandas.to_json()."""
try:
return obj.as_dict() # all MSONable objects implement as_dict()
except AttributeError:
if isinstance(obj, ProvenanceDoc):
needed_attrs = ("foo", "bar", ...)
return {k: obj[k] for k in needed_attrs}
return None # replace unhandled objects with None in serialized data
df.to_json("some-data.json.gz", default_handler=as_dict_handler)
@janosh, interesting. That's new to me. Thanks for the tip!
@munrojm just wondering if there was an update on this issue?
I believe monty dumpfn/loadfn can serialize and de-serialize both pandas DataFrames and pydantic models, but I haven't actually verified both simultaneously. Seems like it'd be a common use case however.