cylc-flow
cylc-flow copied to clipboard
gethostbyname_ex error when running Cylc 8 server behind SSH bastion
Describe the bug
We are testing a new environment for our Cylc servers, where the Cylc servers run on a small cluster OOD
separate to the main HPC. A user requests a node on the OOD system, e.g. ood-vn17
then connects to that node by ssh via a bastion server in order to run Cylc. Individual OOD nodes are not externally accessible without going through the bastion server.
I have set up cylc to use communication method = ssh
for the hpc platform, and have SSH configured on the HPC so that ssh ood-vn17
will automatically tunnel through the bastion server using ProxyJump
.
When running a workflow however communications fail from the HPC to the cylc server. Setting debug=true
in a job suite gives the error message
Traceback (most recent call last):
File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/task_message.py", line 107, in send_messages
pclient = get_client(workflow)
File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client_factory.py", line 55, in get_client
return get_runtime_client(get_comms_method(), workflow, timeout=timeout)
File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/client_factory.py", line 49, in get_runtime_client
return WorkflowRuntimeClient(workflow, timeout=timeout)
File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/ssh_client.py", line 52, in __init__
self.host, _, _ = get_location(workflow)
File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/network/__init__.py", line 83, in get_location
host = get_fqdn_by_host(host)
File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/hostuserutil.py", line 265, in get_fqdn_by_host
return HostUtil.get_inst().get_fqdn_by_host(target)
File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/hostuserutil.py", line 171, in get_fqdn_by_host
return self._get_host_info(target)[0]
File "/g/data/access/ngm/miniconda3/envs/cylc-8.0rc3/lib/python3.9/site-packages/cylc/flow/hostuserutil.py", line 135, in _get_host_info
self._host_exs[target] = socket.gethostbyname_ex(target)
socket.gaierror: [Errno -2] Name or service not known: 'ood-vn17.z48'
gethostbyname_ex
is expected to fail here as ood-vn17
is not network accessible from the HPC
Release version(s) and/or repository branch(es) affected?
$ cylc --version
8.0rc3
Steps to reproduce the bug
Configure network so that the Cylc server can't be seen from the platform, e.g. using a SSH bastion, and set the platform communication method to ssh
Expected behavior
If I submit a Cylc 8 task from my OOD node with communication method = ssh
and appropriate SSH configuration for ProxyJump
I expect communications to work by tunnelling through the bastion server, without giving an error from other network connection types
Screenshots
Additional context
Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md
before starting any work though).
This isn't a use case we currently support but it should be simple to get it working.
If you are able to patch your Cylc installation I think this should permit your use case:
diff --git a/cylc/flow/network/__init__.py b/cylc/flow/network/__init__.py
index e456de3fb..3f4e91e70 100644
--- a/cylc/flow/network/__init__.py
+++ b/cylc/flow/network/__init__.py
@@ -16,8 +16,10 @@
"""Package for network interfaces to Cylc scheduler objects."""
import asyncio
+from contextlib import suppress
import getpass
import json
+import socket
import zmq
import zmq.asyncio
@@ -78,7 +80,8 @@ def get_location(workflow: str):
raise WorkflowStopped(workflow)
host = contact[ContactFileFields.HOST]
- host = get_fqdn_by_host(host)
+ with suppress(socket.gaierror):
+ host = get_fqdn_by_host(host)
port = int(contact[ContactFileFields.PORT])
if ContactFileFields.PUBLISH_PORT in contact:
pub_port = int(contact[ContactFileFields.PUBLISH_PORT])
Thanks this has worked well - I also had to modify the host self-identification so that it used the hostname rather than the fqdn. If no issues come up in our testing I'll make a pull request with both changes.
@ScottWales how did it work out?