dpgen icon indicating copy to clipboard operation
dpgen copied to clipboard

[Feature Request] Use HDF5 instead of numpy files

Open njzjz opened this issue 3 years ago • 2 comments

Summary

For DeePMD-kit training, when there are large numbers of systems, use HDF5 instead of NumPy files.

Detailed Description

When there are large numbers of systems, it consumes a lot of time to transfer large number of small NumPy files to a supercomputer cluster with bad I/O performance. A HDF5 file can store multiple arrays so it is faster to be transfer. The test results produce the behavior.

Further Information, Files, and Links

deepmodeling/deepmd-kit#1163

njzjz avatar Dec 12 '21 08:12 njzjz

Do you assume that we can load all systems in the memory? How shall we handle the case that the data size is larger than the memory?

wanghan-iapcm avatar Jan 25 '22 01:01 wanghan-iapcm

HDF5 does not read the whole file into memory. See https://stackoverflow.com/a/40460400

njzjz avatar Jan 25 '22 11:01 njzjz