emmet [Feature Request]: Better handling of parsed trajectory in VASP calculations

Problem

The current calculation model used for MD simulations in atomate2 relies on the store_trajectory option in the Calculation model. I think that there are some problems associated with the current handling of the trajectory. Those related to emmet concern the amount of data that is stored. In fact, for each frame of the Trajectory object the whole IonicStep is stored as well https://github.com/materialsproject/emmet/blob/ebbe8d5124c9e2bd5b743698836c7a17c6d693a5/emmet-core/emmet/core/vasp/calculation.py#L732-L734

Considering that this feature is used to store the output of (potentially long) MD simulation, this has some issues:

each ionic step contains the Structure of the step, so each Structure is essentially stored twice in the Trajectory. It seems that the structure stored in the IonicStep does not include any additional property, so it is entirely redundant.
for each step all the SCF steps are also stored as well, which amount to a considerable amount of data being stored in the DB
Some additional information that is only available for MD simulation is not considered. In particular I am referring to the temperature, that would be interesting to have for post-processing analysis.

The first two points are not blocking issues, but when storing the output of long simulations they will probably have a big impact on the size required for the stored data. Combined, they will probably require roughly 3x the space required if only the structure and the final properties would be stored for each step.

The parsing of large MD trajectory may also be demanding in terms of resources and in general I believe that this could be optionally handled by extracting the data from the new vasp hd5 output file, rather than from the vasprun.xml. So this issue might be linked to the one I opened in atomate2: https://github.com/materialsproject/atomate2/issues/515

Also tagging @utf and @mjwen as this may impact the MD flow in atomate2.

Proposed Solution

I am aware of the fact that if the trajectory is stored its content covers also the one of the ionic_steps in the CalculationOutput model. But I think that it would be reasonable to have the following changes

Remove the Structure from the frame properties to avoid redundancy. This should be easy to do and I don't see any downside. The structure attribute is even optional in IonicStep
I don't see a strict need for the electronic steps to be present in a long MD simulation, but there are probably cases where this might be needed. I would thus propose to have the store_trajectory attribute as a multi valued flag, that would allow to store the full IonicStep (except for the Structure) or a subset of the data. To preserve backward compatibility this could be defined as
```
store_trajectory: bool | Literal["partial"] = False,
```
Where the bool values have the same meaning as the current option, while the partial value removes the electronic steps from the dictionary. I would make this the default for the atomate2 MD worflows.

Concerning the temperature, this could be retrieved from the OSZICAR and is already available in the Oszicar object in pymatgen. This would require the parsing of an additional object, but, as far as I can see, the values are not present in the vasprun.xml. I am not sure if this would better fit in the frame_properties of the trajectory, or as an attribute of the model on its own.

If these suggestions are fine I can open a PR with their implementation. Otherwise, do you have other suggestions on how to modify the current model?

Alternatives

No response

Oct 26 '23 11:10 gpetretto

The Trajectory object itself also has a very large file size compared to other file formats which store trajectories (anecdotal), I think maybe 3-5x. I wonder if it could be store more compactly, since the benefit of storing with JSON isn’t really realised since individual keys within the Trajectory are rarely used for searching etc.

Oct 26 '23 16:10 mkhorton

@gpetretto thanks for your post on this. I have no objections on the pure emmet side. Happy push through whatever the atomate2 folks agree with.

Oct 31 '23 22:10 munrojm

Thanks @gpetretto! I've seen your implementation in issue #886 and I agree with your proposed approach. Definitely, we need to reduce the redundancy.

Nov 14 '23 16:11 mjwen

The Trajectory object itself also has a very large file size compared to other file formats which store trajectories (anecdotal), I think maybe 3-5x.

@mkhorton Interesting observation! Would be great if we could further reduce the storage need.

Can we easily serialize a MSONable object into other formats like hdf5? Or, by other file formats which store trajectories do you mean there are file formats specifically defined for this, like LAMMPS .dump or ASE trajectory?

Nov 14 '23 16:11 mjwen

emmet emmet copied to clipboard

[Feature Request]: Better handling of parsed trajectory in VASP calculations

Problem

Proposed Solution

Alternatives

emmet
emmet copied to clipboard