nequip icon indicating copy to clipboard operation
nequip copied to clipboard

Multiple Training / Validation Datasets🌟 [FEATURE]

Open tgmaxson opened this issue 7 months ago • 3 comments

Is your feature request related to a problem? Please describe. It is a common problem we run across internally that we wish to train models on partial datasets which are kept in separate files as well as combined models. For example, imagine a simple case.

  • water-only.traj
  • water-NaCl.traj
  • water-KCl.traj

Ideally, we should be able to read these files independently in Nequip and sample from them as if they were one file. Making the file pairs quickly becomes unwieldy and expensive (in terms of space). Additionally, the cached datasets then also have to be regenerated and stored as well.

Describe the solution you'd like Simple extension to the dataloader syntax to accept a list of filenames, not just a single filename. The data would then be lumped together and used as normal. From the "ase" dataloader persepctive, this just involves appending multiple ase files together. As an alternative, ASE can also be extended to read multiple files potentially from a specialized filename, but I suspect that will get pushback from the devs (and not result in the proper caching on nequip's end).

dataset_file_name: /mnt/public/tgmaxson/datasets/7-4-24/train.traj # Single filename

or

dataset_file_name: # Multiple filenames
  - /mnt/public/tgmaxson/datasets/7-4-24/train.traj
  - /mnt/public/tgmaxson/datasets/7-2-24/train.traj

tgmaxson avatar Jul 08 '24 18:07 tgmaxson