pyiron_base icon indicating copy to clipboard operation
pyiron_base copied to clipboard

Hashing of `DataContainer`

Open samwaseda opened this issue 2 years ago • 13 comments

There was a pyiron meeting (where @pmrv was unfortunately not present), where we talked about hashing DataContainer, which would allow the user to set a unique job name for each job containing a unique input data without much thinking about it, i.e. something like job = pr.create.job.MyJob(('my_job', job.input.get_hash())). Now after my quick research, I found this one very simple and elegant. However, I'm not really sure how to handle the ordering of entries, which doesn't matter in most of the codes (e.g. VASP), but does matter in some codes (e.g. LAMMPS), and the thing becomes even more complicated when it partially matters (e.g. SPHInX). Let me know what you think.

samwaseda avatar Oct 07 '22 07:10 samwaseda

As far as DataContainer is concerned order matters and should be reflected in the "hash".

pmrv avatar Nov 08 '22 21:11 pmrv

Maybe we can add a little interface HasDigest that gives a hash and classes where ordering doesn't matter can appropriately overwrite it.

pmrv avatar Nov 09 '22 10:11 pmrv

Right now I'm thinking about hashing the content of what would be stored in HDF. For this, I'm now trying to create a pseudo-HDF class to get the hdf content, which is then converted to dict to hash it. I will probably have to redefine a lot of functions already defined in hdfio.py. Do you have a better idea?

samwaseda avatar Nov 17 '22 09:11 samwaseda

So, here's my first implementation, which works only if the hdf already exists:

import numpy as np
from typing import Dict, Any
import hashlib
import json

def dict_hash(dictionary: Dict[str, Any]) -> str:
    dhash = hashlib.md5()
    encoded = json.dumps(dictionary, sort_keys=True).encode()
    dhash.update(encoded)
    return dhash.hexdigest()

def serialize(v):
    if isinstance(v, np.ndarray):
        return v.tolist()
    return v

def to_dict(s):
    results = {k: serialize(s[k]) for k in s.list_nodes()}
    for k in s.list_groups():
        results[k] = to_dict(s[k])
    return results

This gives:

from pyiron_atomistics import Project

pr = Project('HASH')
spx = pr.create.job.Sphinx('spx')
spx.structure = pr.create.structure.bulk('Al', cubic=True)
spx.run()
print(dict_hash(to_dict(spx['input'])))

Output: 382a062448b9052727d703074988e860

samwaseda avatar Nov 17 '22 10:11 samwaseda

I started working on the implementation here.

samwaseda avatar Nov 17 '22 15:11 samwaseda

That means only things are written already can be hashed, yes? I also don't like to dump everything into json, because it a) might not work for all data that can be in a DataContainer and b) can be memory intensive.

Here's a sketch around the idea of Merkle trees.


from functools import singledispatch

@singledispatch
def digest(value: Any) -> int:
  return hash(value)

@digest.register(DataContainer)
def digest(value):
  # follows the recommendation here: https://docs.python.org/3/reference/datamodel.html?highlight=__hash__#object.__hash__
  return hash(tuple( (k, digest(v)) for k, v in value.items() ))

@digest.register(Atoms)
def digest(value):
   ...

This has the advantage that hashes of sub datacontainers could be cached. I've used the singledispatch here just to not pollute DataContainers namespace and allow to overload it for builtin types.

pmrv avatar Nov 19 '22 10:11 pmrv

You mean this should be defined independently for each class? I'm not really sure if we really want to do that. The great thing about HDF is that since to_hdf is already there, we can directly get the serialised data directly.

And yes, otherwise I like the idea of Merkle trees. From I can see, hash cannot hash list (and probably neither np.ndarray), so maybe we should include the check for the data type here.

samwaseda avatar Nov 19 '22 14:11 samwaseda

You mean this should be defined independently for each class? I'm not really sure if we really want to do that. The great thing about HDF is that since to_hdf is already there, we can directly get the serialised data directly.

Yeah, I would just implement it for the important classes. We could think about using some of our more general interfaces to make it more convenient, like digest.register(HasGroups), but I'm not sure this will always be helpful.

And yes, otherwise I like the idea of Merkle trees. From I can see, hash cannot hash list (and probably neither np.ndarray), so maybe we should include the check for the data type here.

Yes, for mutable types hash doesn't work. That's one of the reasons I put a new function, so we can do digest.regsiter(list) etc.

pmrv avatar Nov 21 '22 15:11 pmrv

The fact that to_hdf always saves serialised data in hdf can from my point of view be exploited in the hashing, i.e. either we refactor the functions inside FileHDFIO or create a pseudo hdf class that mimics the functions to store the data in a dictionary, then hash the content. This allows us to avoid writing all the extra digest functions.

samwaseda avatar Nov 22 '22 08:11 samwaseda

I guess we can try to put a mock class of ProjectHDFio, then call to_hdf with this and hash the output, but that might be tricky to get right, especially when the classing serializing themselves expect certain HDF features to be available. It would be worth a try though, since this would also lead the way to a general interface for storage backends and we had discussed already something like this for e.g. S3 storage.

pmrv avatar Nov 22 '22 08:11 pmrv

So I created this pseudo HDF class:

class PseudoHDF(dict):
    def __enter__(self):
        return self
 
    def __exit__(self, *args):
        pass

    def open(self, group_name):
        self[group_name] = PseudoHDF()
        return self[group_name]

Together with the hashing function:

import numpy as np
def get_hash(h):
    if isinstance(h, dict):
        return hash(tuple((k, get_hash(v)) for k, v in h.items()))
    elif isinstance(h, list) or isinstance(h, np.ndarray):
        try:
            return hash(tuple(h))
        except TypeError:
            return hash((get_hash(hh) for hh in h))
    else:
        return h

BUT then I realized that when I use it it gives me a different number every time. How is that possible?? @pmrv

from pyiron_atomistics import Project
structure = Project('.').create.structure.bulk('Al', cubic=True)
hdf = PseudoHDF()
structure.to_hdf(hdf)
get_hash(hdf)

samwaseda avatar Jan 04 '23 17:01 samwaseda

Iirc the builtin hash function is just the pointer cast to an int, so it should change everytime you do a new process, but should be constant within one process.

pmrv avatar Jan 04 '23 17:01 pmrv

Ok this one seems to work:

import numpy as np
import hashlib
import json

def digest(h):
    return hashlib.md5(json.dumps(h).encode('utf-8')).hexdigest()

def get_hash(h):
    if isinstance(h, dict):
        return digest({k: get_hash(v) for k, v in h.items()})
    elif isinstance(h, list) or isinstance(h, np.ndarray):
        try:
            return digest(np.array(h).tolist())
        except TypeError:
            return digest((get_hash(hh) for hh in h))
    else:
        return digest(h)

Example:

from pyiron_atomistics import Project
structure = Project('.').create.structure.bulk('Al', cubic=True)
hdf = PseudoHDF()
structure.to_hdf(hdf)

print(get_hash(hdf))

Output: 8e0cfdfde8b027ac49e87579f575aadd

samwaseda avatar Jan 04 '23 21:01 samwaseda