fdb-joshua icon indicating copy to clipboard operation
fdb-joshua copied to clipboard

Collect all test results and handle cancelled tests properly

Open sfc-gh-kmakino opened this issue 4 years ago • 4 comments

This PR addresses 3 issues:

  • When multiple tests start simultaneously, started can go beyond max_runs. In this scenario, the agent should wait for all tests to complete, rather than stop at max_runs and ignore the still running jobs
  • When agents die or considered to be dead due to not heart beating, we should ignore the cancelled tests as we don't know the results
  • When enough agents are running to serve available ensembles, other agents should timeout

sfc-gh-kmakino avatar Oct 06 '21 23:10 sfc-gh-kmakino

Given that the unit tests failed (in the Github Actions build), I think this needs another look.

ammolitor avatar Oct 07 '21 16:10 ammolitor

@ammolitor Can you tell what failed? build.sh works totally fine locally here.

sfc-gh-kmakino avatar Oct 07 '21 16:10 sfc-gh-kmakino

@ammolitor CI passed. It would great if you can take another quick look. Thanks!

sfc-gh-kmakino avatar Oct 08 '21 06:10 sfc-gh-kmakino

Now I realized the scaler needs to be aware of this change. Converting this as a draft for now.

sfc-gh-kmakino avatar Oct 13 '21 15:10 sfc-gh-kmakino