cylc-flow
cylc-flow copied to clipboard
`cylc scan --ping` not removing contact files
Describe the bug
According to its documentation cylc scan --ping
should remove contact files for suites it's not able to connect to.
--ping Test the connection to the flow. Scan normally just
reads flow contact files, but --ping forces a
connection to the scheduler and removes the contact
file if it is not found to be running (this can happen
if the scheduler gets killed and can't clean up after
itself).
If the server node cannot be contacted however only a warning is printed and the contact file remains
# Server node cannot be contacted - session has ended
$ cylc scan --ping --verbose
2022-08-10T09:53:53+10:00 DEBUG - zmq:send {'command': 'graphql', 'args': {'request_string': 'query { workflows(ids: ["u-cp519/run10"]) { \nstatus\n } }',
'variables': {}}, 'meta': {'prog': 'scan', 'host': 'ood-vn26', 'comms_method': 'zmq'}}
2022-08-10T09:53:58+10:00 DEBUG - $ ssh -oBatchMode=yes -oConnectTimeout=10 ood-vn3 env CYLC_VERSION=8.0.0 bash --login -c 'exec "$0" "$@"'
/g/data/access/ngm/miniconda3/envs/cylc-8.0/ubin/cylc psutil # returned 255
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by 10.0.128.131 port 22
2022-08-10T09:53:58+10:00 WARNING - Cannot determine whether workflow is running on ood-vn3.
/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/bin/python /g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/bin/cylc play u-cp519
# Expect this file to be removed after `cylc scan --ping` failed to connect
$ ls ~/cylc-run/u-cp519/run10/.service/contact
/home/562/saw562/cylc-run/u-cp519/run10/.service/contact
Release version(s) and/or repository branch(es) affected?
$ cylc --version
8.0.0
Steps to reproduce the bug
- Have two Cylc servers on different computers sharing the same filesystem
- Start a Suite on server 1, then terminate server 1 while the suite is still running so the contact file remains
- On server 2 run
cylc scan --poll
to remove the contact file, currently does not remove the contact file - On server 2 run
cylc clean
to remove files from the failed run, currently does not remove files as the contact file is still present and server 1 cannot be contacted
Expected behavior
The contact file should be removed after Cylc fails to connect to a running server
Additional context
My main goal is to be able to run cylc clean
to clean up the failed run, the clean fails with an error if the contact file remains and so I'm trying to use cylc scan --ping
to remove it.
We are running Cylc servers on a system that assigns you a specific node for your session, and doesn't allow you to connect to nodes you don't have a running session on. We know that if the ssh connection fails then there is not an active session and hence no cylc server can be running.
Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md
before starting any work though).
The current behaviour is deliberate - ssh could fail for other reasons so we don't remove the contact file unless we can connect to the server to check whether the workflow is still running.
We need to think about the best way to address your requirement.
We are running Cylc servers on a system that assigns you a specific node for your session, and doesn't allow you to connect to nodes you don't have a running session on. We know that if the ssh connection fails then there is not an active session and hence no cylc server can be running.
Ha, I've run into the same problem for NeSI HPC users in NZ who come in via JupyterHub to an interactive Slurm session. Once the session ends, you don't have access to the node that it ran on.
Would a potential solution be to add a site config option that treats a session as not running if it's not contactable by SSH? That could then apply to all tools, allowing cylc clean
to be used directly without needing to remove the contact files.
Yeah, that could be a solution. I'm not sure we could do anything else really, given inability to access the original host.
I've added this to the agenda for tonight's project meeting (8pm NZ time, I can forward Teams invite if you'd like to attend - but no pressure!)
Actually @ScottWales - on my system, if this happens we get an error message saying something like "access denied because you don't have any processes running on this node". Do you get anything similar. In the unlikely event that that is a standard response, we could presumably parse it and infer that nothing is running there, and so delete the contact file.
The ssh command returns
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by 10.0.128.131 port 22
Interesting, I'll compare my result later ...
And sure, I'd like to come along to the meeting, it would be good to check we've got our installation set up properly. My email's now scott.wales at bom.gov.au
Invite forwarded. (Also an invite to the "Cylc General" Element chat room, in case the Teams invite borks for some reason).
See also #5013