pyiron_base
pyiron_base copied to clipboard
Hashing of `DataContainer`
There was a pyiron meeting (where @pmrv was unfortunately not present), where we talked about hashing DataContainer
, which would allow the user to set a unique job name for each job containing a unique input data without much thinking about it, i.e. something like job = pr.create.job.MyJob(('my_job', job.input.get_hash()))
. Now after my quick research, I found this one very simple and elegant. However, I'm not really sure how to handle the ordering of entries, which doesn't matter in most of the codes (e.g. VASP), but does matter in some codes (e.g. LAMMPS), and the thing becomes even more complicated when it partially matters (e.g. SPHInX). Let me know what you think.
As far as DataContainer
is concerned order matters and should be reflected in the "hash".
Maybe we can add a little interface HasDigest
that gives a hash and classes where ordering doesn't matter can appropriately overwrite it.
Right now I'm thinking about hashing the content of what would be stored in HDF. For this, I'm now trying to create a pseudo-HDF class to get the hdf content, which is then converted to dict
to hash it. I will probably have to redefine a lot of functions already defined in hdfio.py
. Do you have a better idea?
So, here's my first implementation, which works only if the hdf already exists:
import numpy as np
from typing import Dict, Any
import hashlib
import json
def dict_hash(dictionary: Dict[str, Any]) -> str:
dhash = hashlib.md5()
encoded = json.dumps(dictionary, sort_keys=True).encode()
dhash.update(encoded)
return dhash.hexdigest()
def serialize(v):
if isinstance(v, np.ndarray):
return v.tolist()
return v
def to_dict(s):
results = {k: serialize(s[k]) for k in s.list_nodes()}
for k in s.list_groups():
results[k] = to_dict(s[k])
return results
This gives:
from pyiron_atomistics import Project
pr = Project('HASH')
spx = pr.create.job.Sphinx('spx')
spx.structure = pr.create.structure.bulk('Al', cubic=True)
spx.run()
print(dict_hash(to_dict(spx['input'])))
Output: 382a062448b9052727d703074988e860
I started working on the implementation here.
That means only things are written already can be hashed, yes? I also don't like to dump everything into json, because it a) might not work for all data that can be in a DataContainer
and b) can be memory intensive.
Here's a sketch around the idea of Merkle trees.
from functools import singledispatch
@singledispatch
def digest(value: Any) -> int:
return hash(value)
@digest.register(DataContainer)
def digest(value):
# follows the recommendation here: https://docs.python.org/3/reference/datamodel.html?highlight=__hash__#object.__hash__
return hash(tuple( (k, digest(v)) for k, v in value.items() ))
@digest.register(Atoms)
def digest(value):
...
This has the advantage that hashes of sub datacontainers could be cached. I've used the singledispatch
here just to not pollute DataContainers
namespace and allow to overload it for builtin types.
You mean this should be defined independently for each class? I'm not really sure if we really want to do that. The great thing about HDF is that since to_hdf
is already there, we can directly get the serialised data directly.
And yes, otherwise I like the idea of Merkle trees. From I can see, hash
cannot hash list
(and probably neither np.ndarray
), so maybe we should include the check for the data type here.
You mean this should be defined independently for each class? I'm not really sure if we really want to do that. The great thing about HDF is that since
to_hdf
is already there, we can directly get the serialised data directly.
Yeah, I would just implement it for the important classes. We could think about using some of our more general interfaces to make it more convenient, like digest.register(HasGroups)
, but I'm not sure this will always be helpful.
And yes, otherwise I like the idea of Merkle trees. From I can see,
hash
cannot hashlist
(and probably neithernp.ndarray
), so maybe we should include the check for the data type here.
Yes, for mutable types hash
doesn't work. That's one of the reasons I put a new function, so we can do digest.regsiter(list)
etc.
The fact that to_hdf
always saves serialised data in hdf
can from my point of view be exploited in the hashing, i.e. either we refactor the functions inside FileHDFIO
or create a pseudo hdf class that mimics the functions to store the data in a dictionary, then hash the content. This allows us to avoid writing all the extra digest functions.
I guess we can try to put a mock class of ProjectHDFio
, then call to_hdf
with this and hash the output, but that might be tricky to get right, especially when the classing serializing themselves expect certain HDF features to be available. It would be worth a try though, since this would also lead the way to a general interface for storage backends and we had discussed already something like this for e.g. S3 storage.
So I created this pseudo HDF class:
class PseudoHDF(dict):
def __enter__(self):
return self
def __exit__(self, *args):
pass
def open(self, group_name):
self[group_name] = PseudoHDF()
return self[group_name]
Together with the hashing function:
import numpy as np
def get_hash(h):
if isinstance(h, dict):
return hash(tuple((k, get_hash(v)) for k, v in h.items()))
elif isinstance(h, list) or isinstance(h, np.ndarray):
try:
return hash(tuple(h))
except TypeError:
return hash((get_hash(hh) for hh in h))
else:
return h
BUT then I realized that when I use it it gives me a different number every time. How is that possible?? @pmrv
from pyiron_atomistics import Project
structure = Project('.').create.structure.bulk('Al', cubic=True)
hdf = PseudoHDF()
structure.to_hdf(hdf)
get_hash(hdf)
Iirc the builtin hash function is just the pointer cast to an int, so it should change everytime you do a new process, but should be constant within one process.
Ok this one seems to work:
import numpy as np
import hashlib
import json
def digest(h):
return hashlib.md5(json.dumps(h).encode('utf-8')).hexdigest()
def get_hash(h):
if isinstance(h, dict):
return digest({k: get_hash(v) for k, v in h.items()})
elif isinstance(h, list) or isinstance(h, np.ndarray):
try:
return digest(np.array(h).tolist())
except TypeError:
return digest((get_hash(hh) for hh in h))
else:
return digest(h)
Example:
from pyiron_atomistics import Project
structure = Project('.').create.structure.bulk('Al', cubic=True)
hdf = PseudoHDF()
structure.to_hdf(hdf)
print(get_hash(hdf))
Output: 8e0cfdfde8b027ac49e87579f575aadd