satellite
satellite copied to clipboard
Limit number of hosts that can be disabled via black hole detector
The concern here is that if a host becomes considered a black hole, there is no way for it to get automatically re-added to the whitelist (because once removed from the whitelist, it will no longer be assigned tasks, so it will never get a chance to prove that it is now capable once again of completing tasks!). Thus, if something happens that blackholes a very large number of hosts, extensive manual intervention would be required in order to re-enable hosts.
Suggested safeguards:
- configurable limit of # of hosts that can be removed via black hole detection per minute
- configurable maximum % of total hosts that can be considered black holes at any given time
If a host would be "black holed", but this would cause one of the safeguards above to be violated, log the issue and just move on (take no other action). If the host continues to fail a lot of tasks, it will be re-evaluated as a black hole soon enough (and maybe at that point there will be enough room in the black hole dungeon).