espaloma icon indicating copy to clipboard operation
espaloma copied to clipboard

how to implement data object

Open yuanqing-wang opened this issue 4 years ago • 1 comments

let's discuss here how should we design the dataset object to host training data that contains

  • molecular graph
  • coordinates

Ideally, we want to be able to load large dataset while implementing ways to shuffle, batch, and sample. Note that there are many tricks we can do to sample the dataset since it's naturally partitioned by molecular graph.

(a subclasss of torch.DataLoader would be nice although we need to hack it to make dgl.Graph compellable)

Also I think we need to think about the possibilities to distribute it across machines. (we can ignore this for now but I think if we can make it compilable with torch.nn.DataParallel it would make large-scale training faster)

Regarding the energy terms in the dataset, for QM datasets, we would need to include QM energies.

@maxentile argued that we should also include terms in the factor graph, namely bond, angle, and non-bonded energy.

I'd suggest that, for MM datasets, for the sake of simplicity, we emit these terms, since we would have MM energy functions anyway to do:

u_g = u(g, x)

and thus we can have, in the training stage, depending on what target we're fitting, either

loss = loss_fn(g_ref, g_hat)

or

loss = loss_fn(u(g_ref, x), u(g_hat, x))

which I think is easier to debug.

yuanqing-wang avatar May 18 '20 19:05 yuanqing-wang

I think we need at least a couple different dataset types, for the different tasks:

  • Interaction-type classification tasks (control)
  • Energy regression tasks
    • Fitting to QM
    • Fitting to MM (control)

@maxentile argued that we should also include terms in the factor graph, namely bond, angle, and non-bonded energy.

Sorry for unclarity: for the "recovering MM model from snapshots and energies" task, I proposed to include 4 components in total per snapshot (sum over bond terms, sum over angle terms, sum over torsion terms, sum over nonbonded terms), not one component per term in factor graph (https://github.com/choderalab/espaloma/issues/2#issuecomment-629440754). This doesn't preclude using the total energy as a regression target: sum these 4 components and use the result as a target. Splitting it up into a few components was intended to allow us to construct simpler and more diagnostic tasks if needed. (Can we recover HarmonicBondForce from (snapshot, energy) pairs, when that's really the only force present in the system? This should be easier than recovering the whole forcefield at once. Is our implementation of the potential energies consistent with OpenMM for valence terms, but not NonbondedTerms? Here are a bunch of test cases.) I'm not proposing we try an analogous decomposition for QM tasks.

For the interaction type classification task (no coordinates), I think we need a discrete label for each individual interaction term. For example, in SMIRNOFF it is not the case that each torsion-type is determined by the vdW atom-types of its component atoms. (I may have missed if we decided not to try to reproduce bond/angle/torsion-types from SMIRNOFF.)

maxentile avatar May 18 '20 19:05 maxentile