dpgen
dpgen copied to clipboard
[Feature Request] Use HDF5 instead of numpy files
Summary
For DeePMD-kit training, when there are large numbers of systems, use HDF5 instead of NumPy files.
Detailed Description
When there are large numbers of systems, it consumes a lot of time to transfer large number of small NumPy files to a supercomputer cluster with bad I/O performance. A HDF5 file can store multiple arrays so it is faster to be transfer. The test results produce the behavior.
Further Information, Files, and Links
deepmodeling/deepmd-kit#1163
Do you assume that we can load all systems in the memory? How shall we handle the case that the data size is larger than the memory?
HDF5 does not read the whole file into memory. See https://stackoverflow.com/a/40460400