espaloma
espaloma copied to clipboard
AlkEthOH interaction-typing task
Add datasets for tasks of classifying atom/bond/angle/torsion types for molecules in AlkEthOH rings set, and provide simple rule-based baselines for each task.
Done:
- [x] Add script to download AlkEthOH rings dataset, label atom- and interaction-types using OpenFF 1.0.0 Parsley forcefield, and save discrete labels. (Tracked using Git LFS.)
- [x] Add PyTorch Dataset interfaces to
AlkEthOH{Atom|Bond|Angle|Torsion}TypesDataset
s
Todo:
- [x] Update paths from
hfgp
toespaloma
- [x] Update tests, make sure resources can be found (currently using relative paths, should use
pkg_resources.resource_filename
s) - [ ] Discuss with @yuanqing-wang whether PyTorch Dataset interface is suitable, make adjustments
Add script to download AlkEthOH rings dataset
What's the rationale behind just using the rings? Limiting the number of compounds?
What's the rationale behind just using the rings? Limiting the number of compounds?
No rationale, just a starting point -- intention is still to use also the AlkEthOH chains set and other sets listed in https://github.com/choderalab/espaloma/issues/2#issuecomment-629440754 and https://github.com/choderalab/espaloma/issues/2#issuecomment-629519708 .
Hmm, although the Pytorch views in https://github.com/choderalab/espaloma/blob/973d5e1de00b60390b93a054c4277db632569b04/espaloma/data/alkethoh/pytorch_datasets.py satisfy the pytorch Dataset interface, they don't yet play nice with DataLoader.
For example,
import torch
dataset = AlkEthOHAtomTypesDataset()
loader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True)
runs into this error:
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'openforcefield.topology.molecule.Molecule'>
Possible workarounds:
- define a non-default collate function that handles OpenFF
Molecule
s - replace OpenFF
Molecule
with something that default_collate knows what to do with, such as a dict or a dgl graph containing similar information
cc @jaimergp @t-kimber on the dataset issue above.