pyiron_atomistics icon indicating copy to clipboard operation
pyiron_atomistics copied to clipboard

Create a method to write atomistics data summary in DataFrame format, which can be output as a file for sharing

Open ligerzero-ai opened this issue 2 years ago • 6 comments

Somewhat closely related: #831

Tl;dr Create a method to generate a file which contains lightweight structure/energy/forces summary similar to TrainingContainer which allows lightweight sharing of data.

For a lightweight way of transferring/sharing data from different users in the workshop (and arguably just a really good idea in general), we found that it would be really nice to have the ability to share dataframes containing training data, which comes in a standardised form in TrainingContainer.

It is generally desirable to be able to write to a file df objects that contain atomic structure information in the form of stored Atoms objects, but this is unfortunately not possible, as ASE Atoms objects are not json-able. #643 #831 related, as pickling is also not possible. There is a workaround for pickling, but this unfortunately relies on the really buggy pack/unpack functionality #831. I would argue that the packing/unpack is just unnecessarily heavy for the use-case in which all one really needs is a dataframe which should be json-able or outputable to some normal file format.

It should "just work" for users and not require user intervention or hackarounds that are only known through prior debugging efforts.

Therefore, I propose the inclusion of write_TrainingContainer_df and read_TrainingContainer_df pair functionality. I would even argue that the object information is useful beyond even just training containers. E.g. If I want to share data, I could just share a file (e.g. a data.json), which they can read using the paired function, or even without a pyiron install.

write_TrainingContainer_df should generate a training container for ACE-fitting (which also needs a conversion of stresses that are read from pyiron (3x3) vs what is required for training in the ACE routine (6x1)) which contains a hack in the form of creating a Structure column which contains json-able structures and writes the df to a json file.

read_TrainingContainer_dfshould read thejson` file container for ACE-fitting and return a df that is directly compatible/concat-able/append-able to an existing TrainingContainers dataframe, which can then be fed to the fitting job.

If we make the method more generic, I would argue that a method like pr.create_data_df which has a submethod to_json which allows outputs to a json file would be really nice, which contains the basic standard atomistics data that most people are interested in (job_name, structure, energy, forces, stresses). One could also imagine that it allows other columns to be added, which are parsed upon explicit request from the user (e.g. magmom lists, etc. etc.).

So essentially it would work like

df = pr.create_data_df(default_cols = ["job_name","structure","energy","forces","stresses"])

and querying the docstring returns a list of added raw data outputs which can also be parsed upon request to create the table.

Let me know if it already exists. I am aware of pyiron tables functionality, but it seems very clunky in comparison and I think that this style of function would be way more convenient to have for pyiron users by default. For projects which include huge amounts of calculations, I would imagine that pyiron tables would be the way to go, but I think most projects fall significantly below the amount where pyiron tables has more value (e.g. not iterating over millions of jobs over and over again) than it is clunky to use compared to the functionality being proposed here.

One can also imagine its use in standardised generation of data summary files in accompaniment to publications, for example.

ligerzero-ai avatar Oct 22 '22 22:10 ligerzero-ai

I am in principle on board with other export types than our hdf for specific classes like the TrainingContainer which probably could have a method to export e.g. json (once we fixed the nasty. I cannot pickle this behavior... Could we not use the to_hdf routine produced dict to make a json for which we would also know how to read it in?) Maybe, this could be done more generic like we now use the to_hdf functionality but to get a json out of it?

What I do not like that much is to have a different pyiron_table which only provides the data frame. If this is desirable on the project level, I could think about a convenience function running a pyiron_table under the hood and returning its data frame directly. This combines the 'easy access' and the functionality of the pyiron_table which might be extended if needed.

niklassiemer avatar Oct 23 '22 09:10 niklassiemer

Is the TrainingContainer related functionality not covered by train.to_pandas().to_pickle(...) and train.include_data(pd.load_pickle(...)) (or whatever the pandas function to load a pickle dataframe is called again)?

pmrv avatar Oct 23 '22 12:10 pmrv

For the more generic request, I would rather modify project.create_table to take this list of columns that you propose and transparently create the table and return the output. We have a lot of default functions defined already for the pyiron table and I don't see a point in reimplementing it.

pmrv avatar Oct 23 '22 12:10 pmrv

So just to make sure we're on the same page; modify the current pr.create_table() method to generate the table that I'm describing here by default? (job name, structure, energy, forces, stresses)?

ligerzero-ai avatar Oct 23 '22 12:10 ligerzero-ai

Is the TrainingContainer related functionality not covered by train.to_pandas().to_pickle(...) and train.include_data(pd.load_pickle(...)) (or whatever the pandas function to load a pickle dataframe is called again)?

Yes, going to and from pandas should be sufficient for most cases. The only nice addition for sharing would be directly from/to disk, but that is only nice to have :)

niklassiemer avatar Oct 23 '22 12:10 niklassiemer

So just to make sure we're on the same page; modify the current pr.create_table() method to generate the table that I'm describing here by default? (job name, structure, energy, forces, stresses)?

Yeah, it could also do table.run(); return table.get_dataframe() imo.

pmrv avatar Oct 23 '22 12:10 pmrv