dpgen icon indicating copy to clipboard operation
dpgen copied to clipboard

add fp_async_check_ratio param to enable async check for fp jobs

Open shazj99 opened this issue 4 years ago • 0 comments

While running dpgen jobs, we found the duration time of FP phrase is a large part in each iteration. This is because the DFT computing for some candidates are really hard and time consuming, and we need to wait all those long tail to be finished before going to the next iteration. We found that the proportion of those candidates is very small, which may less than 1%.

So we add a new param fp_async_check_ratio to optimize the execution: if 99% of candidates have finished the FP step for 600 seconds, then we put the job of checking remains to an async process. Obviously, the data will be skipped in the coming train process, but we will add it back in next iteration. It may happen that those candidates are lost for some reason such as unexpected error or no retry in async process, but we think it is acceptable as the the ratio is very small.

Following is our test number, which significantly saving time: image

Usage(add in param.json):
"fp_async_check_ratio": 0.99,

Change-Id: I448708b6b29635d172f40bf61b7c0a6397832b5b

shazj99 avatar Oct 26 '21 10:10 shazj99