pyiron_base
pyiron_base copied to clipboard
New Generic Job interface to unify input & output AND make jobs hashable
Sketch from yesterday's pyiron meeting.
Two main ideas:
- Make
job.input
andjob.output
universal - Create an extra step between job and HDF to make it job -> (serialised) data -> HDF, to make it hashable.
class HasHDF:
def to_hdf(self, hdf):
data = self.serialize()
data.to_hdf(hdf)
def serialize(self):
return cereal
def get_hash(self, **args):
try:
return hash(self.serialize)
except NotSerializableError:
raise
class GenericJob(HasStorage):
def __init__(self, **args):
super().__init__(**args)
self.storage.create_group('input')
self.storage.create_group('output')
self.storage.input = self._create_default_input()
@property
def input(self):
return self.storage.input
@property
def output(self):
return self.storage.input
@abstractmethod
def _create_default_input(self):
# To be overwritten in the child classes
return DataContainer()
class AtomisticsGenericJob(GenericJob):
...
@property
def structure(self):
return self.input.structure
@structure.setter
def structure(self, new_structure):
job.input.structure = new_structure
class HeavyClass(GenericJob):
...
def to_hdf(self, hdf):
def get_hash(self, **args):
# Implement `get_hash` if data is too heavy for `serialize()`
@pmrv I cannot remember whether serialise
should return a dictionary or DataContainer
.
I think a DataContainer
is more flexible and you are already using self.serialize().to_hdf
already anyway.
Just like we discussed about the changes of not writing to the file system, I think also these changes should be made in pyiron_contrib
first in particular when changing the behavior of such fundamental classes as the SparseList
of the Atoms
class in https://github.com/pyiron/pyiron_atomistics/pull/974
Just like we discussed about the changes of not writing to the file system, I think also these changes should be made in
pyiron_contrib
first in particular when changing the behavior of such fundamental classes as theSparseList
of theAtoms
class in pyiron/pyiron_atomistics#974
Maybe I didn't make myself very clear in the last discussion, but the reason why we/I didn't want to have the LAMMPS feature in the main pyiron version was because of lack of concept, which potentially could lead to the following problems:
- Not maintainable by other people
- User interface might change
- Relations to other features become vague
- Multiple functionalities with the same purpose might coexist
- Not compatible with other features
On the other hand, in this issue we are presenting a new concept, i.e. we would be able to solve the exact problems I named above. So the argument "because the changes of not writing to the file system had to be done elsewhere, this one should be done also elsewhere" doesn't apply.
This being said, if you have other arguments, I'm also ok with making these changes in pyiron_contrib
. I decided to make the changes here while keeping backward compatibility in all changes, so there shouldn't be any disruption, but I also don't really have a strong feeling for doing it directly here.
So the argument "because the changes of not writing to the file system had to be done elsewhere, this one should be done also elsewhere" doesn't apply.
To be more explicit. The risks I see with changing the HDF5 storage interface is that maintaining backwards compatibility is very challenging. So to me the change you purpose here is orders of magnitude more drastic than the one I suggested. At the current stage you believe your proposed change is the final version, given on my experience with the internal complexity of pyiron I am strongly against this. So I can only advice to first implement a prototype in pyiron_contrib
. This is what Liam did for the graph based workflows https://github.com/pyiron/pyiron_contrib/pull/577 and what I did for the pyiron interface without writing files:
- https://github.com/pyiron/pyiron_contrib/pull/584
- https://github.com/pyiron/pyiron_contrib/pull/589
- https://github.com/pyiron/pyiron_contrib/pull/590
- https://github.com/pyiron/pyiron_contrib/pull/591
- https://github.com/pyiron/pyiron_contrib/pull/599
Only after merging five different pull requests and the implementation reached certain level of stability did I even propose to change the interface in pyiron_base
and still I respected the feedback from the community of pyiron developers and instead of merging https://github.com/pyiron/pyiron_base/pull/1044 I moved it to a separate package https://github.com/pyiron/pyiron_lammps . This is the typical procedure like we did it for all recent developments, like
- https://github.com/pyiron/pyiron_ontology
- https://github.com/pyiron/ironflow
- https://github.com/pyiron/pylammpsmpi
Now I cannot see a single reason why we should make an exception for the discussion here.
Hm, you are failing to make your point, because we did not reject your proposal because of the code stability. It was because of lack of concept that we rejected it. In this regard, it doesn't matter how many changes you had had in pyiron_contrib
and how stable it would have been, we would have rejected it anyway, because we don't want to maintain lines and lines of code with no clear plan of what they are doing (or we would make other people suffer, like we did recently to @ligerzero-ai ).
maintaining backwards compatibility is very challenging
Yeah I know, so I'm gonna go a very long way. But whether I make all the changes first in pyiron_contrib
or directly in pyiron_base
, we must guarantee the backward compatibility anyway. So I'm trying to make a lot of very small PR's to make sure that backward compatibility is guaranteed.
This is the typical procedure like we did it for all recent developments
The items you are naming are fundamentally different, because they are new features, while I'm here proposing to sort the code by principles, i.e. I'm only cleaning up the code. Critically, with the changes that I'm proposing here, the user won't see any difference (except for maybe get_hash
, which I'm not even sure if we're gonna implement anyway).
Anyway, in order to be on the safe side, here are the steps that I'm gonna make:
- Make
DataContainer
compatible with external tags (either already done here or @pmrv will do it soon) - Refactor code wherever it's necessary to make sure that
to_hdf
does only storing (andfrom_hdf
only loading), which was something @liamhuber and I also discussed here - Separate
to_hdf
andserialise
to make it possible to retrieve data which is to be stored in hdf.
So far, there shouldn't be any difference for the user. Then
- Unify
input
andoutput
toDataContainer
(and maybe also toGenericParameter
, but anyway to things that do the storing and loading systematically) - Maybe implement
get_hash
Except for 4., there is no big change, and probably even 4. is not a big change because we will break it down to small steps. And there will be nothing that would be different for the user interface, maybe except for the appearance of serialise
and potentially also get_hash
.
By the way, it's a side story, but if it was two years ago I would have accepted your LAMMPS changes, because at least for me pyiron was just a simulation tool at that time. But today we are offering pyiron as a workflow platform, so now it matters a lot that we have concepts on the code level.
- Refactor code wherever it's necessary to make sure that
to_hdf
does only storing (andfrom_hdf
only loading), which was something @liamhuber and I also discussed here- Separate
to_hdf
andserialise
to make it possible to retrieve data which is to be stored in hdf.
To me 2 and 3 are big changes because those require a change of the storage interface and consequently a change of each job class which is not yet using the Datacontainer. This is where I see the risk of breaking backwards compatibility. That is why I propose writing new classes based on the new hierarchy rather than try to update the current interface. This gives us the freedom to migrate one user or workflow at a time.
But today we are offering pyiron as a workflow platform, so now it matters a lot that we have concepts on the code level.
I completely agree. But I disagree on the point that attaching a storage interface to each job object is the right way to move forward, just because this is how pyiron was initially developed.
During the development of pyiron I always aim to implementing conceptual clean code, so if there is a new concept that is superior to the current concept implemented in pyiron I want to stay at least openminded to implementing it. If we just do things the way we always did them, then I do not see any chance for pyiron to survive in the long-term as the requirements are changing.
Coming back to our strategy discussion https://github.com/pyiron/team/discussions/156 I am not convinced that the job class should be derived from the storage interface. Rather for me a job or better quantum engine should just take an input dictionary and return an output dictionary, similar to what I have implemented in https://github.com/pyiron/pyiron_lammps . Based on these quantum engines a minimal workflow could be just a quantum engine with a storage container attached to it. The storage container then saves the input as well as the output. So for an individual calculation this workflow is very similar to the GenericJob
class we have currently. Consequently, a user would not submit a quantum engine to the queuing system but always a workflow, even if it is just a minimal workflow with a quantum engine and a storage container. Now when the user wants to calculate an energy volume curve with multiple calculation, he can use exactly the same workflow components. In this case the workflow is executed in serial and one structure is evaluated at a time, but by default the output of all structures is stored in one storage container, rather than one storage container per structure. If the user wants the current implementation, then they can also use a workflow of sub-workflows, where each sub-workflow just handles a single calculation. Alternatively the user could also attach two quantum engines to calculate the energy volume curve, each doing half of the calculation and most likely they would preferable save the output in the same storage container.
I do not have a working implementation of this workflow concept, but from what I understand about Liam's work in https://github.com/pyiron/pyiron_contrib/pull/577 the concept is evolving constantly. This is why I do not want to limit myself to attaching the storage to the job just because that is what we did in the past.
ok then let's freeze this one until the pyiron meeting, ok? I have to admit that this issue was anyway not well written, because the title says "make jobs hashable" but what I really want to do is to clean the origin of why it is not already possible (which was already discussed during the pyiron meeting last week, but you were not there anymore).