fdb-joshua Collect all test results and handle cancelled tests properly

Collect all test results and handle cancelled tests properly

Open sfc-gh-kmakino opened this issue 4 years ago • 4 comments

This PR addresses 3 issues:

When multiple tests start simultaneously, started can go beyond max_runs. In this scenario, the agent should wait for all tests to complete, rather than stop at max_runs and ignore the still running jobs
When agents die or considered to be dead due to not heart beating, we should ignore the cancelled tests as we don't know the results
When enough agents are running to serve available ensembles, other agents should timeout

Oct 06 '21 23:10 sfc-gh-kmakino

Given that the unit tests failed (in the Github Actions build), I think this needs another look.

Oct 07 '21 16:10 ammolitor

@ammolitor Can you tell what failed? build.sh works totally fine locally here.

Oct 07 '21 16:10 sfc-gh-kmakino

@ammolitor CI passed. It would great if you can take another quick look. Thanks!

Oct 08 '21 06:10 sfc-gh-kmakino

Now I realized the scaler needs to be aware of this change. Converting this as a draft for now.

Oct 13 '21 15:10 sfc-gh-kmakino