pyiron_atomistics
pyiron_atomistics copied to clipboard
Create a method to write atomistics data summary in DataFrame format, which can be output as a file for sharing
Somewhat closely related: #831
Tl;dr Create a method to generate a file which contains lightweight structure/energy/forces summary similar to TrainingContainer
which allows lightweight sharing of data.
For a lightweight way of transferring/sharing data from different users in the workshop (and arguably just a really good idea in general), we found that it would be really nice to have the ability to share dataframes containing training data, which comes in a standardised form in TrainingContainer
.
It is generally desirable to be able to write to a file df objects that contain atomic structure information in the form of stored Atoms objects, but this is unfortunately not possible, as ASE Atoms objects are not json-able. #643 #831 related, as pickling is also not possible. There is a workaround for pickling, but this unfortunately relies on the really buggy pack/unpack functionality #831. I would argue that the packing/unpack is just unnecessarily heavy for the use-case in which all one really needs is a dataframe which should be json-able or outputable to some normal file format.
It should "just work" for users and not require user intervention or hackarounds that are only known through prior debugging efforts.
Therefore, I propose the inclusion of write_TrainingContainer_df
and read_TrainingContainer_df
pair functionality. I would even argue that the object information is useful beyond even just training containers. E.g. If I want to share data, I could just share a file (e.g. a data.json), which they can read using the paired function, or even without a pyiron install.
write_TrainingContainer_df
should generate a training container for ACE-fitting (which also needs a conversion of stresses that are read from pyiron (3x3) vs what is required for training in the ACE routine (6x1)) which contains a hack in the form of creating aStructure
column which contains json-able structures and writes the df to ajson
file.
read_TrainingContainer_df
should read the
json` file container for ACE-fitting and return a df that is directly compatible/concat-able/append-able to an existing TrainingContainers dataframe, which can then be fed to the fitting job.
If we make the method more generic, I would argue that a method like pr.create_data_df
which has a submethod to_json
which allows outputs to a json file would be really nice, which contains the basic standard atomistics data that most people are interested in (job_name, structure, energy, forces, stresses). One could also imagine that it allows other columns to be added, which are parsed upon explicit request from the user (e.g. magmom lists, etc. etc.).
So essentially it would work like
df = pr.create_data_df(default_cols = ["job_name","structure","energy","forces","stresses"])
and querying the docstring returns a list of added raw data outputs which can also be parsed upon request to create the table.
Let me know if it already exists. I am aware of pyiron tables functionality, but it seems very clunky in comparison and I think that this style of function would be way more convenient to have for pyiron users by default. For projects which include huge amounts of calculations, I would imagine that pyiron tables would be the way to go, but I think most projects fall significantly below the amount where pyiron tables has more value (e.g. not iterating over millions of jobs over and over again) than it is clunky to use compared to the functionality being proposed here.
One can also imagine its use in standardised generation of data summary files in accompaniment to publications, for example.
I am in principle on board with other export types than our hdf for specific classes like the TrainingContainer
which probably could have a method to export e.g. json (once we fixed the nasty. I cannot pickle this behavior... Could we not use the to_hdf routine produced dict to make a json for which we would also know how to read it in?) Maybe, this could be done more generic like we now use the to_hdf functionality but to get a json out of it?
What I do not like that much is to have a different pyiron_table
which only provides the data frame. If this is desirable on the project level, I could think about a convenience function running a pyiron_table
under the hood and returning its data frame directly. This combines the 'easy access' and the functionality of the pyiron_table which might be extended if needed.
Is the TrainingContainer
related functionality not covered by train.to_pandas().to_pickle(...)
and train.include_data(pd.load_pickle(...))
(or whatever the pandas function to load a pickle dataframe is called again)?
For the more generic request, I would rather modify project.create_table
to take this list of columns that you propose and transparently create the table and return the output. We have a lot of default functions defined already for the pyiron table and I don't see a point in reimplementing it.
So just to make sure we're on the same page; modify the current pr.create_table() method to generate the table that I'm describing here by default? (job name, structure, energy, forces, stresses)?
Is the
TrainingContainer
related functionality not covered bytrain.to_pandas().to_pickle(...)
andtrain.include_data(pd.load_pickle(...))
(or whatever the pandas function to load a pickle dataframe is called again)?
Yes, going to and from pandas should be sufficient for most cases. The only nice addition for sharing would be directly from/to disk, but that is only nice to have :)
So just to make sure we're on the same page; modify the current pr.create_table() method to generate the table that I'm describing here by default? (job name, structure, energy, forces, stresses)?
Yeah, it could also do table.run(); return table.get_dataframe()
imo.