espaloma icon indicating copy to clipboard operation
espaloma copied to clipboard

AlkEthOH interaction-typing task

Open maxentile opened this issue 4 years ago • 4 comments

Add datasets for tasks of classifying atom/bond/angle/torsion types for molecules in AlkEthOH rings set, and provide simple rule-based baselines for each task.

Done:

  • [x] Add script to download AlkEthOH rings dataset, label atom- and interaction-types using OpenFF 1.0.0 Parsley forcefield, and save discrete labels. (Tracked using Git LFS.)
  • [x] Add PyTorch Dataset interfaces to AlkEthOH{Atom|Bond|Angle|Torsion}TypesDatasets

Todo:

  • [x] Update paths from hfgp to espaloma
  • [x] Update tests, make sure resources can be found (currently using relative paths, should use pkg_resources.resource_filenames)
  • [ ] Discuss with @yuanqing-wang whether PyTorch Dataset interface is suitable, make adjustments

maxentile avatar May 28 '20 22:05 maxentile

Add script to download AlkEthOH rings dataset

What's the rationale behind just using the rings? Limiting the number of compounds?

jchodera avatar May 28 '20 22:05 jchodera

What's the rationale behind just using the rings? Limiting the number of compounds?

No rationale, just a starting point -- intention is still to use also the AlkEthOH chains set and other sets listed in https://github.com/choderalab/espaloma/issues/2#issuecomment-629440754 and https://github.com/choderalab/espaloma/issues/2#issuecomment-629519708 .

maxentile avatar May 28 '20 22:05 maxentile

Hmm, although the Pytorch views in https://github.com/choderalab/espaloma/blob/973d5e1de00b60390b93a054c4277db632569b04/espaloma/data/alkethoh/pytorch_datasets.py satisfy the pytorch Dataset interface, they don't yet play nice with DataLoader.

For example,

import torch
dataset = AlkEthOHAtomTypesDataset()
loader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True)

runs into this error:

TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'openforcefield.topology.molecule.Molecule'>

Possible workarounds:

  • define a non-default collate function that handles OpenFF Molecules
  • replace OpenFF Molecule with something that default_collate knows what to do with, such as a dict or a dgl graph containing similar information

maxentile avatar Jun 01 '20 19:06 maxentile

cc @jaimergp @t-kimber on the dataset issue above.

jchodera avatar Jun 01 '20 21:06 jchodera