clustershell
clustershell copied to clipboard
ClusterShell.Propagation.RouteResolvingError: No route available to pm4-nod01
Hello,
When using clustershell with milkcheck, I have an error:
ClusterShell.Propagation.RouteResolvingError: No route available to pm4-nod01
The error comes as soon as I have a topology file.
The debug mode of milkcheck shows:
Traceback (most recent call last):bmc]
File "/usr/lib/python3.6/site-packages/MilkCheck/UI/Cli.py", line 538, in execute
self.manager.call_services(services, action, conf=self._conf)
File "/usr/lib/python3.6/site-packages/MilkCheck/ServiceManager.py", line 173, in call_services
self.run(action)
File "/usr/lib/python3.6/site-packages/MilkCheck/Engine/Service.py", line 236, in run
action_manager_self().run()
File "/usr/lib/python3.6/site-packages/MilkCheck/Engine/Action.py", line 182, in run
self._master_task.run()
File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 877, in run
self.resume(timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 831, in resume
self._resume()
File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 794, in _resume
self._run(self.timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 404, in _run
self._engine.run(timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 723, in run
self.runloop(timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/EPoll.py", line 170, in runloop
self.remove_stream(client, stream)
File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 520, in remove_stream
self.remove(client)
File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 495, in remove
self._remove(client, abort, did_timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 483, in _remove
client._close(abort=abort, timeout=did_timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Exec.py", line 142, in _close
self.worker._check_fini()
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Exec.py", line 384, in _check_fini
self._has_timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Worker.py", line 55, in _eh_sigspec_invoke_compat
return method(*args)
File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 417, in ev_close
mw._relaunch(gateway)
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 404, in _relaunch
self._launch(targets)
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 265, in _launch
next_hops = self._distribute(self.task.info("fanout"), nodes.copy())
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 342, in _distribute
for gw, dstset in self.router.dispatch(dst_nodeset):
File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 106, in dispatch
yield self.next_hop(host), host
File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 141, in next_hop
str(dst))
ClusterShell.Propagation.RouteResolvingError: No route available to pm4-nod01
I cannot reproduce the error using clush only:
$ clush --remote=no -u2 -bw pm4-nod01 hostname
---------------
pm4-nod01
---------------
mngt0-2
$ clush -u2 -bw pm4-nod01 hostname
---------------
pm4-nod01
---------------
pm4-nod01
$ cat /etc/clustershell/topology.conf
[routes]
mngt0-1: mngt0-2
mngt0-2: @compute
Python version 3.6.8
In order to have a temporary fix I did change this:
--- /usr/lib/python3.6/site-packages/ClusterShell/Propagation.py.orig 2023-06-27 15:00:39.099237135 +0200
+++ /usr/lib/python3.6/site-packages/ClusterShell/Propagation.py 2023-06-27 15:00:47.504344461 +0200
@@ -405,7 +405,7 @@ class PropagationChannel(Channel):
self.logger.debug("ev_close rc=%s", self._rc) # may be None
# NOTE: self._rc may be None if the communication channel has aborted
- if self._rc != 0:
+ if self._rc != 0 and not self._rc == None:
self.logger.debug("error on gateway %s (setup=%s)", gateway,
self.setup)
self.task.router.mark_unreachable(gateway)
And this:
--- /bin/milkcheck.orig 2024-09-04 09:19:15.826180684 +0200
+++ /bin/milkcheck 2024-09-04 09:19:22.076099490 +0200
@@ -1,4 +1,4 @@
-#!/usr/libexec/platform-python
+#!/usr/bin/python3
#
# Copyright CEA (2011)
# Contributor: Jeremie TATIBOUET
Hi, We have the same issue here from ClusterShell version 1.8.3. It seems that in gateway mode, at the end of the first communication with the gateway, the treewoker is brutally killed and no return code is set. This explain why @MarbolanGos patch is working.
Here is a reproducer
- topology.conf:
[routes]
vm0: vm1
vm1: vm2
- clustershell version
$ clush -Bw vm[0-2] clush --version
---------------
vm[0-2] (3)
---------------
clush 1.9.1
- reproducer
$ cat test_bad_gw_simple.py
#!/bin/python3
import logging
import sys
from ClusterShell.Task import task_self
task = task_self()
#logging.basicConfig(level=logging.DEBUG)
for i in range(0,3):
print(f"> Launching command on {sys.argv[1]} [{i}]")
task.shell("/bin/uname -a", nodes=sys.argv[1])
task.run()
task.join()
- Results
# No gateway everything is fine
$ ./test_bad_gw_simple.py vm0
> Launching command on vm0 [0]
> Launching command on vm0 [1]
> Launching command on vm0 [2]
# The gateway is the running node everything is fine
$ ./test_bad_gw_simple.py vm1
> Launching command on vm1 [0]
> Launching command on vm1 [1]
> Launching command on vm1 [2]
# The gateway is a distant node
$ ./test_bad_gw_simple.py vm2
> Launching command on vm2 [0]
> Launching command on vm2 [1]
Traceback (most recent call last):
File "./test_bad_gw_simple.py", line 15, in <module>
task.run()
File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 873, in run
self.resume(timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 827, in resume
self._resume()
File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 790, in _resume
self._run(self.timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 403, in _run
self._engine.run(timeout)
File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 717, in run
self.start_ports()
File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 689, in start_ports
self.register(port._start())
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/EngineClient.py", line 508, in _start
self.eh.ev_port_start(self)
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 122, in ev_port_start
self.treeworker._start()
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 220, in _start
self._launch(self.nodes)
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 265, in _launch
next_hops = self._distribute(self.task.info("fanout"), nodes.copy())
File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 342, in _distribute
for gw, dstset in self.router.dispatch(dst_nodeset):
File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 106, in dispatch
yield self.next_hop(host), host
File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 141, in next_hop
str(dst))
ClusterShell.Propagation.RouteResolvingError: No route available to vm2
- Same run but with clustershell debug level activated clustershell_debug_gateway.trace.txt
@MarbolanGos you can't reproduce it with clush since it launches a command only once per task through a gateway, whereas milkcheck uses the same task to launch multiple commands. The issue is on clustershell but only when using the API.
Hope it will help :)
A good candidate for this bug seems to be this commit:
commit 89d6e1b166b7b6fc1bff4f3ccaa5c33d8303080b
Author: Stephane Thiell <[email protected]>
Date: Sat Sep 28 15:01:30 2019 -0700
Tree: bug fixes and improvements (#388,#419)
- propagate CLUSTERSHELL_GW_PYTHON_EXECUTABLE environment variable
to remote gateways (multi-hop)
- fix defect to properly close channel when worker has aborted
- on unexpected stderr messages from gateways, an initiator Channel
will now craft StdErrMessage objects so that subclasses like
PropagationChannel can let the user know about the error.
Closes #388.
Closes #419.
Change-Id: Ied4cb49e6bbe27bf89174bd2a1fc2c9bdc2e0557
Thanks for the great bug report!
I will have a look ASAP. I can't remember off the top of my head why I did that, but it was to fix something... 😁
Link to the commit you mentioned: https://github.com/cea-hpc/clustershell/commit/9e059e78534cffeeb7350d1be0d576465652dd5a
Hi,
It seems that this code fixes this issue :
diff --git a/lib/ClusterShell/Task.py b/lib/ClusterShell/Task.py
index f220949..cbc21c3 100644
--- a/lib/ClusterShell/Task.py
+++ b/lib/ClusterShell/Task.py
@@ -67,7 +67,7 @@ from ClusterShell.Engine.Factory import PreferredEngine
from ClusterShell.Worker.EngineClient import EnginePort, EngineClientError
from ClusterShell.Worker.Popen import WorkerPopen
from ClusterShell.Worker.Tree import TreeWorker
-from ClusterShell.Worker.Worker import FANOUT_UNLIMITED
+from ClusterShell.Worker.Worker import FANOUT_UNLIMITED, _eh_sigspec_invoke_compat
from ClusterShell.Event import EventHandler
from ClusterShell.MsgTree import MsgTree
@@ -1383,6 +1383,8 @@ class Task(object):
if len(metaworkers) == 0:
logger.debug("pchannel_release: destroying channel %s",
chanworker.eh)
+ chanworker.eh._rc = 0
+ _eh_sigspec_invoke_compat(chanworker.eh.ev_hup, 2, self, gateway, 0)
chanworker.abort()
# delete gateway reference
del self.gateways[gwstr]
The idea is to set event handler return code to 0 before aborting the propagation channel. By doing this, we can keep the first patch that check return code != 0.
I don't know if it's the good way to fix that but I think it cover all cases. What do you think ?
From the "bad commit" you linked at it seems that if _rc was left to None the channel didn't close normally, but that path does look like a normal close so this makes sense to me! Open a PR with that?
A couple of thoughts/questions:
- perhaps ensure _rc wasn't set already? This might overwrite some real error?
- Too many events, sorry.. isn't that hup's Propagation.py's in which case it just sets _rc, or is it another one? Trying to understand why it's needed here..
- (in "clush" mode it makes sense to close gateways when we're done with them, but for something like milkcheck that can run multiple actions perhaps it'd make more sense to keep them around? But I guess that'd be a pain to manually manage gateways lifecycle if some are really temporary so it's probably just as good as is... And it's probably much bigger work, so definitely best to just fix this for now)
We have tested this into our system. Before:
[root@master01 ~]# milkcheck step0 -n compute01
[07:12:25] ERROR - Unexpected Exception : No route available to compute01
Applied patch. Now, all tests are fine. Thank you.
Thanks much. Let me explain how I think it is working at the moment, for reference.
A DistantWorker + PropagationChannel event handler is used for each gateway from the "initiator" (the head node in the tree). DistantWorker is likely a WorkerSsh here. I call them channels here. These channels are managed by a Task are then used by TreeWorker instances to execute commands. For example, TreeWorker._execute_remote() calls this to run commands:
pchan = self.task._pchannel(gateway, self)
pchan.shell()
Anyway, when these actual commands finish, TreeWorker._check_fini() is called which calls Task._pchannel_release() when there is no active targets for this specific gateway from the TreeWorker point of view. The TreeWorker calling this doesn’t know if the gateway is still used by other workers of the Task, this is why the last part is done at the Task level in Task._pchannel_release().
In Task._pchannel_release(), we update the status for the worker calling and then check if the active gateway is still used by any worker. If the gateway has no purpose anymore, Task._pchannel_release() calls chanworker.abort(), chanworker being the DistantWorker + PropagationChannel.
PropagationChannel.ev_close() is called when the gateway channel is aborted (at chanworker.abort()) and also when a gateway error occurs (termination initiated by the other side). Here, on abort, because the termination is initiated on our side (not by the remote gateway process), there is no ev_hup generated and rc stays at None. rc is the return code of the command (in that case, the gateway itself), but here we don’t have it as we are the one aborting it, so I believe it makes sense to not raise a fake ev_hup and keep None in that case. This has been the current logic but let me know if you have a better idea.
So what’s the role of PropagationChannel.ev_close() here? Well, it is mainly to handle gateway errors. If we get an actual rc > 0, that means the gateway is defective/misconfigured, in that case, we mark it as unreachable at the Task level. In addition to that, if we have not launched the remote commands yet, they are re-distributed to other available gateways (if any).
Now, I don't remember why I decided to treat rc=None as an error. It's probably a defect. I'm proposing a PR to fix this.
While I believe I have now a good understanding of how it works, there is one use case that I'm not 100% sure: what if the gateway is working but suddenly we receive some garbage from the gateway that we can't parse and we then decide to abort it? In that case, we won't see it as an error (rc will be None), and we won't be able to distinguish this from a normal gateway abort because it has not purpose anymore. If this happens, we won't mark the gateway as unreacheable. It's probably not be a big deal but I wanted to mentioned it so we can think about it a bit more. Maybe an additional flag is needed to keep track of a normal/abnormal close.