kinisi icon indicating copy to clipboard operation
kinisi copied to clipboard

Confusing inputs for the time between steps in the file

Open arm61 opened this issue 1 year ago • 29 comments

kinisi uses the same terminology as pymatgen to define the time between the steps in an input. That is a time_step (the simulation time step for the integrator) and the step_skip (the output frequency to the file).

This is confusing and a stumbling block when new to kinisi.

Therefore, I propose that the meaning of time_step is changed to the time step between trajectory steps in the file and that step_skip is removed entirely. I welcome any thoughts on this.

arm61 avatar Jul 04 '24 14:07 arm61

Yeah, this can be a source of confusion, and the change is a good move. Still, I would introduce a new terminology perhaps time_gap or similar to prevent confusion with people working from notebooks given to them by others as examples without anyone being aware of the change!

user200000 avatar Jul 04 '24 14:07 user200000

Yeah, this can be a source of confusion, and the change is a good move. Still, I would introduce a new terminology perhaps time_gap or similar to prevent confusion with people working from notebooks given to them by others as examples without anyone being aware of the change!

I agree. I think it is best to only take one value from the user, that being the time between frames in the trajectory. This may be future proofing for introducing something like pint.

How will this play with subsampling frames?

jd15489 avatar Jul 04 '24 14:07 jd15489

Still, I would introduce a new terminology perhaps time_gap or similar to prevent confusion

My preference would be to keep the name time_step but if the user provides a step_skip then an error will be raised. Hopefully, removing the confusion.

How will this play with subsampling frames?

I think this would not effect the operation of the subsampling frames.

arm61 avatar Jul 04 '24 14:07 arm61

I think the error plan would help a lot, but I worry that time_step will have a particular MD-based meaning for the user base, and may cause confusion. I could see less experienced scientists getting tangled up with this and not realising it (some trajectory formats include a timestep number and they could assume this is read somehow). There are edge cases with aimd where people could end up publishing incorrect MSDs and diffusion coefficients because they're not so far off as to be preposterous and so the error doesn't get caught.

It wouldn't be Kinisi's fault in this circumstance but I think we owe it to the broader community to guard against such errors.

user200000 avatar Jul 04 '24 14:07 user200000

time_step and step_skip are good names IMO. They both mean what they say. And they have the advantage of being the same as in pymatgen.

bjmorgan avatar Jul 04 '24 14:07 bjmorgan

And they have the advantage of being the same as in pymatgen.

Not everyone is using pymatgen, and hopefully eventually everyone is using kinisi before using pymatgen.

arm61 avatar Jul 04 '24 14:07 arm61

I'm not sure having the same names as another package is advantageous. It only helps a subset of the user base (those familiar with pymatgen), and if pymatgen changes the names in the future this will be moot. Further, I agree that time_step could be confused for the MD integration timestep. timestep is also used by some packages (pymatgen included) to refer to the current timestep, this could cause further confusion.

jd15489 avatar Jul 04 '24 14:07 jd15489

I am coming round to time_step not being a good choice. I don’t really like time_gap, any other ideas?

arm61 avatar Jul 04 '24 14:07 arm61

time_spacing temporal_spacing stride_length (inspired by i-pi)

user200000 avatar Jul 04 '24 15:07 user200000

Potentially step_write_frequency? It's a little long winded, but i think it gets the point across.

osoulas avatar Jul 04 '24 15:07 osoulas

dt frame_step frame_time frame_time_step frame_gap

jd15489 avatar Jul 04 '24 15:07 jd15489

frame_step is my favourite so far. Followed by stride_length and stride_step.

arm61 avatar Jul 04 '24 15:07 arm61

frame_interval

osoulas avatar Jul 04 '24 15:07 osoulas

I am not a fan of the *_interval options. This is because I often talk about $\Delta t$ as the "time interval", and in the past, some people have been confused and think that the ballistic regime is times before the simulation has equilibrated (this is not the case). So, I want to keep simulation time and MSD time descriptors separate.

arm61 avatar Jul 04 '24 15:07 arm61

My issue with frame_step is that, to me, it reads as a number of steps (basically step_skip) rather than a timeframe. Maybe @jd15489's suggestion of frame_time would make more sense?

osoulas avatar Jul 04 '24 16:07 osoulas

  1. timestep meaning time per integration step is standard MD terminology.
  2. step skip (number of steps skipped per output frame) seems clear.

Maybe we are rediscovering why pymatgen splits these into two arguments?

bjmorgan avatar Jul 04 '24 16:07 bjmorgan

I would argue for keeping these separate, but potentially renaming step_skip. If you run a MD simulation, these are separate inputs with different units. As a user, I always think about my output frequency in units of timesteps, not in units of simulation time ( = timestep × output frequency)

bjmorgan avatar Jul 04 '24 17:07 bjmorgan

Based on the number of confused user questions about this problem (95 % of user problems have been around this), I think you might be alone in that @bjmorgan.

arm61 avatar Jul 04 '24 17:07 arm61

From @bjmorgan:

I agree with the principle that we just multiply these numbers together, but, this means the user needs to remember to multiply together two numbers from their MD input (which they have already shown they don’t understand) and put that number into kinisi. So the current approach you just use the numbers straight from the MD setup instead of doing the multiplication yourself as an extra step.

I think this is a good argument.

arm61 avatar Jul 04 '24 17:07 arm61

I agree that this is a good argument.

What do MD packages call these things? LAMMPS calls these: timestep and N. DLPOLY calls them: timestep and timestep interval. VASP calls them: POTIM (or time step) and NBLOCK.

jd15489 avatar Jul 04 '24 18:07 jd15489

Returning to this … an additional comment would be that it is not uncommon to have datasets where the MD timestep is consistent, but the output frequency changes (maybe you started with a shorter output frequency, increased it later, and do not want to rerun or post-process your original trajectory data). With the existing separation of these two inputs, it is transparent how changes in MD set-up map to changes in kinisi input parameters. Combining these parameters obfuscates how the combined parameter is linked to the MD data. (output files may not report output frequency in time units, but instead will report step frequency, or leave the user to pull this information from the MD input).

bjmorgan avatar Jul 30 '24 12:07 bjmorgan

I have an implementation of kinisi that uses scipp and therefore requires the user to set the values as:

time_step = 2.0 * sc.Unit('fs')
step_skip = 50 * sc.Unit('dimensionless')

I think that this would help with this confusion. But this won't be ready until kinisi version 2.

arm61 avatar Jul 30 '24 12:07 arm61

Documentation for scipp: https://scipp.github.io (xarray with units and variances).

arm61 avatar Jul 30 '24 12:07 arm61

In my opinion, requiring scipp makes usage more complicated, and only users who already understand the difference between MD timestep and write frequency will understand what this is doing.

bjmorgan avatar Jul 30 '24 12:07 bjmorgan

The logic is that declaring the units forces the user to think more about what they are doing and, therefore, understand it.

arm61 avatar Jul 30 '24 12:07 arm61

We want everyone to understand the difference between MD timestep and write frequency.

arm61 avatar Jul 30 '24 12:07 arm61

And forcing units is aligned with the general principle for scientific code that you want your code to be explicit rather than implicit.

bjmorgan avatar Jul 30 '24 12:07 bjmorgan

Yup. There are additional benefits to the use of scipp (e.g., generally explicit unit handling, objects that hold variances alongside the values, etc.)

arm61 avatar Jul 30 '24 12:07 arm61

This is being fixed in the scipp branch.

arm61 avatar Sep 21 '24 11:09 arm61