cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

Poll: polling is not attempted if the platform has run out of hosts

Open wxtim opened this issue 3 months ago • 6 comments

Description

  • If we experience issues contacting a host, the host is added to bad hosts.
  • Polling will purposefully exclude bad hosts.
  • So if there are no good hosts left, each poll attempt will in effect be skipped.
  • The bad hosts are reset every PT3M (default) so it will take a long time before the polling resumes.

This also applies to other IHS (intelligent host select) operations such as job log retrieval.

Reproducible Example

Using this workflow

[scheduling]
    cycling mode = integer
    [[graph]]
        R1 = pbs:started => breaker
[runtime]
    [[pbs]]
        platform = remote
        script = sleep 1h
        execution polling intervals = PT15S
    [[breaker]]
        # Breaks your ssh config
        script = mv ~/.ssh/config ~/.ssh/config.bk
  • Wait for polling to report your platform unavailable.
  • Fix your ssh config.
  • Monitor as polling continues, without success.
  • Try killing the task (it doesn't have any effect)

Similarly, when one starts the workflow, waits for the task submission then cylc stop --now, then breaks one's ssh and restarts the workflow the initial polling doesn't reset the list of bad host after the attempt to poll finishes...

wxtim avatar Sep 22 '25 14:09 wxtim

(have edited the OP to change it into an issue report rather than advocating for the clearing of bad hosts)

oliver-sanders avatar Oct 03 '25 14:10 oliver-sanders

If a host is detected as uncontactable, it is added to bad hosts. It will then be omitted for use by subsequent commands.

The bad hosts list is reset:

  • Every PT30M (configurable).
  • On manual submission if there are no good hosts to submit to.

(as per prior proposals)

The idea behind clearing the bad hosts every PT30M is to allow us to detect resurrected hosts and avoid the situation this bug reports. However PT30M is a long time to wait and this approach is somewhat naive.

It has been suggested that other commands (e.g poll) might also reset bad hosts if there are no good hosts left. However, this is as it s can cause resonance between operations since this state is global.


Proposal (a rethink of the "bad host" logic)

The "bad hosts" set contains valuable information, just blindly wiping it every PT30M seems silly.

Rather than resetting bad hosts at a configured time interval, why not test them instead.

This proposal is to refactor reset bad hosts into check bad hosts:

  • Every so many minutes, the plugin would "check" the first "N" bad hosts.
    • Only check the first N hosts to avoid flooding the system with SSH calls.
    • The "first" N hosts to be chosen by random choice (since we don't preserve the order in which hosts were discovered to be bad) to avoid the situation where there is one good host, but it never gets checked.
  • It would do this by running an SSH command, if the command returns 0, then the host would be removed from bad hosts.
    • ssh true should be enough/
  • The SSH command would be run via the subprocess pool.
    • This will limit the number of SSH check commands run in parallel and prevent the Cylc servers from being overload in the event that a platform goes down.
  • The SSH will be run in the background from the perspective of the main loop plugin.
    • The main loop plugin cannot be blocking, so one iteration of the plugin would start the SSH processes, a subsequent call would collect the results of these and subtract and bad hosts that we can now confirm to be good.
  • We should run this plugin more frequently than PT30M.
    • Which is fine since it is no longer wiping valuable information.

With this approach, poll operations will still do nothing if there are no good hosts left (we should log that polling is not possible in this eventuality), however, with more frequent host checking this won't be a problem.

oliver-sanders avatar Oct 03 '25 14:10 oliver-sanders

I like the idea!

hjoliver avatar Oct 09 '25 02:10 hjoliver

Dave is worried about the impact of the extra SSH'es that this will create if a platform is truly down. It shouldn't be too bad as this mechanism would provide a per-workflow throttle on the number of these SSH'es created, however, it does mean that idle workflows will repeatably poll bad hosts which is a new behaviour.

Ideally, perhaps we would only perform this checking on the next remote operation (rather than just doing it periodically). This is possible, but would be very tricky until #5017. I.E, use a real operation to perform the check. If the host is found to be working, then remove it from the bad hosts list. We would then, presumably, need to check the remainder of the hosts, or just wipe them from the bad hosts set, otherwise they would remain excluded indefinitely.

oliver-sanders avatar Oct 10 '25 12:10 oliver-sanders

What if we did bad_hosts = {'hostname': <seconds_epoch>} and bias the random testing towards more recent bad_hosts?

Or even bad_hosts = {'hostname': [time, time2, time3]} such that we can test hosts with fewer and more recent

wxtim avatar Oct 13 '25 10:10 wxtim

#7055 reduces the urgency on this, punting the milestone back to 8.x.

oliver-sanders avatar Oct 22 '25 14:10 oliver-sanders