forcebalance icon indicating copy to clipboard operation
forcebalance copied to clipboard

[WIP] Revision to Thermo target

Open leeping opened this issue 10 years ago • 1 comments

Progress and notes:

  • Parser for new data format (100% done)
    • Multiple files will be read into a single DataFrame.
    • The "system index" specifies an experimental data set and corresponding simulations (topology, initial conditions, simulation settings and thermodynamic ensemble).
  • Create Observable and Simulation objects from user input (100% done)
    • A single system index may correspond to multiple simulations to be executed.
    • For example, simulating the heat of vaporization would require gas and liquid simulations.
    • Simulating the density would also require running the liquid simulation.
    • Parallelize across system indices and independent initial conditions.
    • Can also parallelize across multiple simulations within a system index if desired (currently performed as a chain).
    • Some observables are not uniquely mapped to simulations (e.g. density can come from a liquid or a solid).
    • Furthermore, the required simulations are not determined automatically from the user-specified observables, because the method for calculating the observable depends on the type of simulation (e.g. compressibility may be calculated for liquids, solids, and bilayers).
    • Thus, the input file must specify both the observables to be calculated and the simulations to be run.
    • Restriction: An error will be thrown if more than one simulation name is provided that can calculate a specified observable. Thus, if the density is specified as an observable, either the liquid or solid simulation must be specified but not both.
    • How to make this more flexible in the future? Perhaps the column heading can contain the system name such as solid_density or liquid_density
    • In order to calculate some timeseries (e.g. deuterium order parameter), the Observable class needs to pass some information to the Simulation. Need to figure out how to do this right.
  • Specify all simulation options in input file parser (50% done)
    • Default settings may apply to all simulations (e.g. eq_steps, md_steps, timestep).
    • If initial conditions are specified in the input file, it should override the default search path for initial coordinate files.
  • Time series class; Split get_timeseries() from molecular_dynamics() (60% done)
    • Represents a time series of instantaneous observables; possibly subclass DataFrame.
    • OpenMM saves observables to memory as the simulation is run, so the names of needed timeseries must be saved as Engine attributes.
    • On the other hand, GROMACS generates all observables in a post-processing step, so the names of needed timeseries don't need to be stored.
    • New observables may require new timeseries to be implemented here.
    • Certain timeseries may only be available for some engines (e.g. quantum kinetic energy estimator from OpenMM).
  • Run simulations and save time series to disk. This can be done using md_chain.py (i.e. a chain of simulations for a particular index), or md_one.py (i.e. independent simulations) (50% done)
    • Replacement for npt.py and npt_lipid.py
    • Should md_chain.py and md_one.py use the same file and directory structure? Need to make sure output from md_one.py is properly named - or put results from md_one.py into different folders.
    • Energy / dipole derivatives are calculated here, also as a time series.
  • Apply MBAR estimator for grouped system indices
    • Applying MBAR estimator across system indices with different molecules makes no sense.
  • Calculate observables from time series. (25% done)
    • Store a dictionary of time series, keyed by the system index and the simulation name.
    • Formulas for calculating observables and their derivatives from time series are implemented here.
    • Observables may require time series from multiple simulations (e.g. heat of vaporization).
    • Observables will still be calculated if experimental data is missing (because it's nice to have a full table of predicted values), but they won't go into the objective function.
    • If experimental data is very sparse then we shouldn't put them in the same Target anyway.
  • Multiple independent initial conditions (50% done)
    • How to organize? I propose targets/target_name/system_index/simulation_name_#.[gro|pdb|xyz] numbered from 1. Multiple files are best because PDB format often doesn't update the periodic box across different structures.
    • If only one initial condition, then _# not needed.
  • The remote scripts md_one.py and md_chain.py should have ways to calculate all observables that they are able to calculate (as an additional way to check consistency)
  • Map abbreviated units to full units
  • XML format parser
  • Added unit tests
    • Read multiple ways of specifying lipid data and check that the data tables are the same.

leeping avatar Apr 04 '14 08:04 leeping

Hi Lee-Ping,

This looks very promising! I will be travelling for a week, but I will take a close look when I am back at work.

Best, Erik

ebran avatar Apr 05 '14 11:04 ebran