fdb-joshua
fdb-joshua copied to clipboard
Collect all test results and handle cancelled tests properly
This PR addresses 3 issues:
- When multiple tests start simultaneously,
startedcan go beyondmax_runs. In this scenario, the agent should wait for all tests to complete, rather than stop atmax_runsand ignore the still running jobs - When agents die or considered to be dead due to not heart beating, we should ignore the cancelled tests as we don't know the results
- When enough agents are running to serve available ensembles, other agents should timeout
Given that the unit tests failed (in the Github Actions build), I think this needs another look.
@ammolitor Can you tell what failed? build.sh works totally fine locally here.
@ammolitor CI passed. It would great if you can take another quick look. Thanks!
Now I realized the scaler needs to be aware of this change. Converting this as a draft for now.