cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

contact: detect crashed workflows

Open oliver-sanders opened this issue 2 years ago • 2 comments

Idea of @dpmatthews

At the moment if a client connection (ZMQ/TCP) fails, then we try to SSH to the scheduler server where the workflow was running and perform a process listing. If the process is found not to be running we delete the contact file, this permits the workflow to be rerun whereas before users would have had to hunt these files down manually.

Instead of just deleting these contact files we could provide a command to list crashed workflows e.g:

  • cylc scan --state=crashed.
  • cylc play $(cylc scan --state=crashed).

The UIS could use this information and alert users to crashes. Sysadmins could potentially scan for crashed workflows.

Needs a little thought e.g. if we don't remove the contact file then any client connections (e.g. cylc message commands from orphaned jobs) will continue to attempt to connect to the workflow which could cause additional load, perhaps we would want to mv contact contact.crashed or something like that.

Probably a fairly straightforward feature to implement.

Pull requests welcome!

oliver-sanders avatar May 05 '22 14:05 oliver-sanders

Good idea.

Instead of either removing the contact file (to prevent connection attempts and allow restart) - which gets rid of the crash evidence; or leaving it as-is - which will result in useless connection attempts; maybe there's a middle ground: add a line to the contact file to indicate that the server is down? (After which deliberate removal of the file would be required).

hjoliver avatar May 06 '22 06:05 hjoliver

See also https://github.com/cylc/cylc-uiserver/issues/257

oliver-sanders avatar May 06 '22 12:05 oliver-sanders