pyiron_base New Generic Job interface to unify input & output AND make jobs hashable

Sketch from yesterday's pyiron meeting.

Two main ideas:

Make job.input and job.output universal
Create an extra step between job and HDF to make it job -> (serialised) data -> HDF, to make it hashable.

class HasHDF:
    def to_hdf(self, hdf):
        data = self.serialize()
        data.to_hdf(hdf)

    def serialize(self):
        return cereal
        
    def get_hash(self, **args):
        try:
            return hash(self.serialize)
        except NotSerializableError:
            raise


class GenericJob(HasStorage):
    
    def __init__(self, **args):
        super().__init__(**args)
        self.storage.create_group('input')
        self.storage.create_group('output')
        self.storage.input = self._create_default_input()

    @property
    def input(self):
        return self.storage.input
        
    @property
    def output(self):
        return self.storage.input        
    
    @abstractmethod
    def _create_default_input(self):
        # To be overwritten in the child classes
        return DataContainer()

class AtomisticsGenericJob(GenericJob):
    ...
    @property
    def structure(self):
        return self.input.structure
    
    @structure.setter
    def structure(self, new_structure):
        job.input.structure = new_structure
        
        
class HeavyClass(GenericJob):
    ...
    def to_hdf(self, hdf):
        
    def get_hash(self, **args):
        # Implement `get_hash` if data is too heavy for `serialize()`

Feb 28 '23 08:02 samwaseda

@pmrv I cannot remember whether serialise should return a dictionary or DataContainer.

Feb 28 '23 08:02 samwaseda

I think a DataContainer is more flexible and you are already using self.serialize().to_hdf already anyway.

Feb 28 '23 12:02 pmrv

Just like we discussed about the changes of not writing to the file system, I think also these changes should be made in pyiron_contrib first in particular when changing the behavior of such fundamental classes as the SparseList of the Atoms class in https://github.com/pyiron/pyiron_atomistics/pull/974

Mar 03 '23 15:03 jan-janssen

Just like we discussed about the changes of not writing to the file system, I think also these changes should be made in pyiron_contrib first in particular when changing the behavior of such fundamental classes as the SparseList of the Atoms class in pyiron/pyiron_atomistics#974

Maybe I didn't make myself very clear in the last discussion, but the reason why we/I didn't want to have the LAMMPS feature in the main pyiron version was because of lack of concept, which potentially could lead to the following problems:

Not maintainable by other people
User interface might change
Relations to other features become vague
- Multiple functionalities with the same purpose might coexist
- Not compatible with other features

On the other hand, in this issue we are presenting a new concept, i.e. we would be able to solve the exact problems I named above. So the argument "because the changes of not writing to the file system had to be done elsewhere, this one should be done also elsewhere" doesn't apply.

This being said, if you have other arguments, I'm also ok with making these changes in pyiron_contrib. I decided to make the changes here while keeping backward compatibility in all changes, so there shouldn't be any disruption, but I also don't really have a strong feeling for doing it directly here.

Mar 03 '23 16:03 samwaseda

So the argument "because the changes of not writing to the file system had to be done elsewhere, this one should be done also elsewhere" doesn't apply.

To be more explicit. The risks I see with changing the HDF5 storage interface is that maintaining backwards compatibility is very challenging. So to me the change you purpose here is orders of magnitude more drastic than the one I suggested. At the current stage you believe your proposed change is the final version, given on my experience with the internal complexity of pyiron I am strongly against this. So I can only advice to first implement a prototype in pyiron_contrib. This is what Liam did for the graph based workflows https://github.com/pyiron/pyiron_contrib/pull/577 and what I did for the pyiron interface without writing files:

https://github.com/pyiron/pyiron_contrib/pull/584
https://github.com/pyiron/pyiron_contrib/pull/589
https://github.com/pyiron/pyiron_contrib/pull/590
https://github.com/pyiron/pyiron_contrib/pull/591
https://github.com/pyiron/pyiron_contrib/pull/599

Only after merging five different pull requests and the implementation reached certain level of stability did I even propose to change the interface in pyiron_base and still I respected the feedback from the community of pyiron developers and instead of merging https://github.com/pyiron/pyiron_base/pull/1044 I moved it to a separate package https://github.com/pyiron/pyiron_lammps . This is the typical procedure like we did it for all recent developments, like

https://github.com/pyiron/pyiron_ontology
https://github.com/pyiron/ironflow
https://github.com/pyiron/pylammpsmpi

Now I cannot see a single reason why we should make an exception for the discussion here.

Mar 03 '23 16:03 jan-janssen

Hm, you are failing to make your point, because we did not reject your proposal because of the code stability. It was because of lack of concept that we rejected it. In this regard, it doesn't matter how many changes you had had in pyiron_contrib and how stable it would have been, we would have rejected it anyway, because we don't want to maintain lines and lines of code with no clear plan of what they are doing (or we would make other people suffer, like we did recently to @ligerzero-ai ).

maintaining backwards compatibility is very challenging

Yeah I know, so I'm gonna go a very long way. But whether I make all the changes first in pyiron_contrib or directly in pyiron_base, we must guarantee the backward compatibility anyway. So I'm trying to make a lot of very small PR's to make sure that backward compatibility is guaranteed.

This is the typical procedure like we did it for all recent developments

The items you are naming are fundamentally different, because they are new features, while I'm here proposing to sort the code by principles, i.e. I'm only cleaning up the code. Critically, with the changes that I'm proposing here, the user won't see any difference (except for maybe get_hash, which I'm not even sure if we're gonna implement anyway).

Anyway, in order to be on the safe side, here are the steps that I'm gonna make:

Make DataContainer compatible with external tags (either already done here or @pmrv will do it soon)
Refactor code wherever it's necessary to make sure that to_hdf does only storing (and from_hdf only loading), which was something @liamhuber and I also discussed here
Separate to_hdf and serialise to make it possible to retrieve data which is to be stored in hdf.

So far, there shouldn't be any difference for the user. Then

Unify input and output to DataContainer (and maybe also to GenericParameter, but anyway to things that do the storing and loading systematically)
Maybe implement get_hash

Except for 4., there is no big change, and probably even 4. is not a big change because we will break it down to small steps. And there will be nothing that would be different for the user interface, maybe except for the appearance of serialise and potentially also get_hash.

Mar 03 '23 17:03 samwaseda

By the way, it's a side story, but if it was two years ago I would have accepted your LAMMPS changes, because at least for me pyiron was just a simulation tool at that time. But today we are offering pyiron as a workflow platform, so now it matters a lot that we have concepts on the code level.

Mar 03 '23 17:03 samwaseda

Refactor code wherever it's necessary to make sure that to_hdf does only storing (and from_hdf only loading), which was something @liamhuber and I also discussed here

Separate to_hdf and serialise to make it possible to retrieve data which is to be stored in hdf.

To me 2 and 3 are big changes because those require a change of the storage interface and consequently a change of each job class which is not yet using the Datacontainer. This is where I see the risk of breaking backwards compatibility. That is why I propose writing new classes based on the new hierarchy rather than try to update the current interface. This gives us the freedom to migrate one user or workflow at a time.

Mar 03 '23 23:03 jan-janssen

But today we are offering pyiron as a workflow platform, so now it matters a lot that we have concepts on the code level.

I completely agree. But I disagree on the point that attaching a storage interface to each job object is the right way to move forward, just because this is how pyiron was initially developed.

During the development of pyiron I always aim to implementing conceptual clean code, so if there is a new concept that is superior to the current concept implemented in pyiron I want to stay at least openminded to implementing it. If we just do things the way we always did them, then I do not see any chance for pyiron to survive in the long-term as the requirements are changing.

Coming back to our strategy discussion https://github.com/pyiron/team/discussions/156 I am not convinced that the job class should be derived from the storage interface. Rather for me a job or better quantum engine should just take an input dictionary and return an output dictionary, similar to what I have implemented in https://github.com/pyiron/pyiron_lammps . Based on these quantum engines a minimal workflow could be just a quantum engine with a storage container attached to it. The storage container then saves the input as well as the output. So for an individual calculation this workflow is very similar to the GenericJob class we have currently. Consequently, a user would not submit a quantum engine to the queuing system but always a workflow, even if it is just a minimal workflow with a quantum engine and a storage container. Now when the user wants to calculate an energy volume curve with multiple calculation, he can use exactly the same workflow components. In this case the workflow is executed in serial and one structure is evaluated at a time, but by default the output of all structures is stored in one storage container, rather than one storage container per structure. If the user wants the current implementation, then they can also use a workflow of sub-workflows, where each sub-workflow just handles a single calculation. Alternatively the user could also attach two quantum engines to calculate the energy volume curve, each doing half of the calculation and most likely they would preferable save the output in the same storage container.

I do not have a working implementation of this workflow concept, but from what I understand about Liam's work in https://github.com/pyiron/pyiron_contrib/pull/577 the concept is evolving constantly. This is why I do not want to limit myself to attaching the storage to the job just because that is what we did in the past.

Mar 03 '23 23:03 jan-janssen

ok then let's freeze this one until the pyiron meeting, ok? I have to admit that this issue was anyway not well written, because the title says "make jobs hashable" but what I really want to do is to clean the origin of why it is not already possible (which was already discussed during the pyiron meeting last week, but you were not there anymore).

Mar 04 '23 07:03 samwaseda

pyiron_base pyiron_base copied to clipboard

New Generic Job interface to unify input & output AND make jobs hashable

pyiron_base
pyiron_base copied to clipboard