clustershell icon indicating copy to clipboard operation
clustershell copied to clipboard

ClusterShell.Propagation.RouteResolvingError: No route available to pm4-nod01

Open MarbolanGos opened this issue 1 year ago • 6 comments

Hello,

When using clustershell with milkcheck, I have an error:

ClusterShell.Propagation.RouteResolvingError: No route available to pm4-nod01

The error comes as soon as I have a topology file.

The debug mode of milkcheck shows:

Traceback (most recent call last):bmc]
  File "/usr/lib/python3.6/site-packages/MilkCheck/UI/Cli.py", line 538, in execute
    self.manager.call_services(services, action, conf=self._conf)
  File "/usr/lib/python3.6/site-packages/MilkCheck/ServiceManager.py", line 173, in call_services
    self.run(action)
  File "/usr/lib/python3.6/site-packages/MilkCheck/Engine/Service.py", line 236, in run
    action_manager_self().run()
  File "/usr/lib/python3.6/site-packages/MilkCheck/Engine/Action.py", line 182, in run
    self._master_task.run()
  File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 877, in run
    self.resume(timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 831, in resume
    self._resume()
  File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 794, in _resume
    self._run(self.timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 404, in _run
    self._engine.run(timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 723, in run
    self.runloop(timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/EPoll.py", line 170, in runloop
    self.remove_stream(client, stream)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 520, in remove_stream
    self.remove(client)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 495, in remove
    self._remove(client, abort, did_timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 483, in _remove
    client._close(abort=abort, timeout=did_timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Exec.py", line 142, in _close
    self.worker._check_fini()
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Exec.py", line 384, in _check_fini
    self._has_timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Worker.py", line 55, in _eh_sigspec_invoke_compat
    return method(*args)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 417, in ev_close
    mw._relaunch(gateway)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 404, in _relaunch
    self._launch(targets)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 265, in _launch
    next_hops = self._distribute(self.task.info("fanout"), nodes.copy())
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 342, in _distribute
    for gw, dstset in self.router.dispatch(dst_nodeset):
  File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 106, in dispatch
    yield self.next_hop(host), host
  File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 141, in next_hop
    str(dst))
ClusterShell.Propagation.RouteResolvingError: No route available to pm4-nod01

I cannot reproduce the error using clush only:

$ clush --remote=no -u2 -bw pm4-nod01 hostname
---------------
pm4-nod01
---------------
mngt0-2
$ clush -u2 -bw pm4-nod01 hostname
---------------
pm4-nod01
---------------
pm4-nod01
$ cat /etc/clustershell/topology.conf
[routes]
mngt0-1: mngt0-2
mngt0-2: @compute

Python version 3.6.8

In order to have a temporary fix I did change this:

--- /usr/lib/python3.6/site-packages/ClusterShell/Propagation.py.orig   2023-06-27 15:00:39.099237135 +0200
+++ /usr/lib/python3.6/site-packages/ClusterShell/Propagation.py        2023-06-27 15:00:47.504344461 +0200
@@ -405,7 +405,7 @@ class PropagationChannel(Channel):
         self.logger.debug("ev_close rc=%s", self._rc) # may be None

         # NOTE: self._rc may be None if the communication channel has aborted
-        if self._rc != 0:
+        if self._rc != 0 and not self._rc == None:
             self.logger.debug("error on gateway %s (setup=%s)", gateway,
                               self.setup)
             self.task.router.mark_unreachable(gateway)

And this:

--- /bin/milkcheck.orig 2024-09-04 09:19:15.826180684 +0200
+++ /bin/milkcheck      2024-09-04 09:19:22.076099490 +0200
@@ -1,4 +1,4 @@
-#!/usr/libexec/platform-python
+#!/usr/bin/python3
 #
 # Copyright CEA (2011)
 #  Contributor: Jeremie TATIBOUET

MarbolanGos avatar Sep 04 '24 07:09 MarbolanGos

Hi, We have the same issue here from ClusterShell version 1.8.3. It seems that in gateway mode, at the end of the first communication with the gateway, the treewoker is brutally killed and no return code is set. This explain why @MarbolanGos patch is working.

Here is a reproducer

  • topology.conf:
[routes]
vm0: vm1
vm1: vm2
  • clustershell version
$ clush -Bw vm[0-2] clush --version
---------------
vm[0-2] (3)
---------------
clush 1.9.1
  • reproducer
$ cat test_bad_gw_simple.py
#!/bin/python3

import logging
import sys

from ClusterShell.Task import task_self

task = task_self()

#logging.basicConfig(level=logging.DEBUG)

for i in range(0,3):
    print(f"> Launching command on {sys.argv[1]} [{i}]")
    task.shell("/bin/uname -a", nodes=sys.argv[1])
    task.run()
    task.join()
  • Results
# No gateway everything is fine
$ ./test_bad_gw_simple.py vm0
> Launching command on vm0 [0]
> Launching command on vm0 [1]
> Launching command on vm0 [2]

# The gateway is the running node everything is fine
$ ./test_bad_gw_simple.py vm1
> Launching command on vm1 [0]
> Launching command on vm1 [1]
> Launching command on vm1 [2]

# The gateway is a distant node
$ ./test_bad_gw_simple.py vm2
> Launching command on vm2 [0]
> Launching command on vm2 [1]
Traceback (most recent call last):
  File "./test_bad_gw_simple.py", line 15, in <module>
    task.run()
  File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 873, in run
    self.resume(timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 827, in resume
    self._resume()
  File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 790, in _resume
    self._run(self.timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Task.py", line 403, in _run
    self._engine.run(timeout)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 717, in run
    self.start_ports()
  File "/usr/lib/python3.6/site-packages/ClusterShell/Engine/Engine.py", line 689, in start_ports
    self.register(port._start())
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/EngineClient.py", line 508, in _start
    self.eh.ev_port_start(self)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 122, in ev_port_start
    self.treeworker._start()
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 220, in _start
    self._launch(self.nodes)
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 265, in _launch
    next_hops = self._distribute(self.task.info("fanout"), nodes.copy())
  File "/usr/lib/python3.6/site-packages/ClusterShell/Worker/Tree.py", line 342, in _distribute
    for gw, dstset in self.router.dispatch(dst_nodeset):
  File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 106, in dispatch
    yield self.next_hop(host), host
  File "/usr/lib/python3.6/site-packages/ClusterShell/Propagation.py", line 141, in next_hop
    str(dst))
ClusterShell.Propagation.RouteResolvingError: No route available to vm2

@MarbolanGos you can't reproduce it with clush since it launches a command only once per task through a gateway, whereas milkcheck uses the same task to launch multiple commands. The issue is on clustershell but only when using the API.

Hope it will help :)

cedeyn avatar Mar 18 '25 15:03 cedeyn

A good candidate for this bug seems to be this commit:

commit 89d6e1b166b7b6fc1bff4f3ccaa5c33d8303080b
Author: Stephane Thiell <[email protected]>
Date:   Sat Sep 28 15:01:30 2019 -0700

    Tree: bug fixes and improvements (#388,#419)
    
    - propagate CLUSTERSHELL_GW_PYTHON_EXECUTABLE environment variable
      to remote gateways (multi-hop)
    - fix defect to properly close channel when worker has aborted
    - on unexpected stderr messages from gateways, an initiator Channel
      will now craft StdErrMessage objects so that subclasses like
      PropagationChannel can let the user know about the error.
    
    Closes #388.
    Closes #419.
    
    Change-Id: Ied4cb49e6bbe27bf89174bd2a1fc2c9bdc2e0557

cedeyn avatar Mar 18 '25 15:03 cedeyn

Thanks for the great bug report!

I will have a look ASAP. I can't remember off the top of my head why I did that, but it was to fix something... 😁

Link to the commit you mentioned: https://github.com/cea-hpc/clustershell/commit/9e059e78534cffeeb7350d1be0d576465652dd5a

thiell avatar Mar 18 '25 15:03 thiell

Hi,

It seems that this code fixes this issue :

diff --git a/lib/ClusterShell/Task.py b/lib/ClusterShell/Task.py
index f220949..cbc21c3 100644
--- a/lib/ClusterShell/Task.py
+++ b/lib/ClusterShell/Task.py
@@ -67,7 +67,7 @@ from ClusterShell.Engine.Factory import PreferredEngine
 from ClusterShell.Worker.EngineClient import EnginePort, EngineClientError
 from ClusterShell.Worker.Popen import WorkerPopen
 from ClusterShell.Worker.Tree import TreeWorker
-from ClusterShell.Worker.Worker import FANOUT_UNLIMITED
+from ClusterShell.Worker.Worker import FANOUT_UNLIMITED, _eh_sigspec_invoke_compat
 
 from ClusterShell.Event import EventHandler
 from ClusterShell.MsgTree import MsgTree
@@ -1383,6 +1383,8 @@ class Task(object):
             if len(metaworkers) == 0:
                 logger.debug("pchannel_release: destroying channel %s",
                             chanworker.eh)
+                chanworker.eh._rc = 0
+                _eh_sigspec_invoke_compat(chanworker.eh.ev_hup, 2, self, gateway, 0)
                 chanworker.abort()
                 # delete gateway reference
                 del self.gateways[gwstr]

The idea is to set event handler return code to 0 before aborting the propagation channel. By doing this, we can keep the first patch that check return code != 0.

I don't know if it's the good way to fix that but I think it cover all cases. What do you think ?

cedeyn avatar Apr 07 '25 16:04 cedeyn

From the "bad commit" you linked at it seems that if _rc was left to None the channel didn't close normally, but that path does look like a normal close so this makes sense to me! Open a PR with that?

A couple of thoughts/questions:

  • perhaps ensure _rc wasn't set already? This might overwrite some real error?
  • Too many events, sorry.. isn't that hup's Propagation.py's in which case it just sets _rc, or is it another one? Trying to understand why it's needed here..
  • (in "clush" mode it makes sense to close gateways when we're done with them, but for something like milkcheck that can run multiple actions perhaps it'd make more sense to keep them around? But I guess that'd be a pain to manually manage gateways lifecycle if some are really temporary so it's probably just as good as is... And it's probably much bigger work, so definitely best to just fix this for now)

martinetd avatar Apr 07 '25 22:04 martinetd

We have tested this into our system. Before:

[root@master01 ~]# milkcheck step0 -n compute01
[07:12:25] ERROR    - Unexpected Exception : No route available to compute01

Applied patch. Now, all tests are fine. Thank you.

MarbolanGos avatar Apr 25 '25 05:04 MarbolanGos

Thanks much. Let me explain how I think it is working at the moment, for reference.

A DistantWorker + PropagationChannel event handler is used for each gateway from the "initiator" (the head node in the tree). DistantWorker is likely a WorkerSsh here. I call them channels here. These channels are managed by a Task are then used by TreeWorker instances to execute commands. For example, TreeWorker._execute_remote() calls this to run commands:

        pchan = self.task._pchannel(gateway, self)
        pchan.shell()

Anyway, when these actual commands finish, TreeWorker._check_fini() is called which calls Task._pchannel_release() when there is no active targets for this specific gateway from the TreeWorker point of view. The TreeWorker calling this doesn’t know if the gateway is still used by other workers of the Task, this is why the last part is done at the Task level in Task._pchannel_release().

In Task._pchannel_release(), we update the status for the worker calling and then check if the active gateway is still used by any worker. If the gateway has no purpose anymore, Task._pchannel_release() calls chanworker.abort(), chanworker being the DistantWorker + PropagationChannel.

PropagationChannel.ev_close() is called when the gateway channel is aborted (at chanworker.abort()) and also when a gateway error occurs (termination initiated by the other side). Here, on abort, because the termination is initiated on our side (not by the remote gateway process), there is no ev_hup generated and rc stays at None. rc is the return code of the command (in that case, the gateway itself), but here we don’t have it as we are the one aborting it, so I believe it makes sense to not raise a fake ev_hup and keep None in that case. This has been the current logic but let me know if you have a better idea.

So what’s the role of PropagationChannel.ev_close() here? Well, it is mainly to handle gateway errors. If we get an actual rc > 0, that means the gateway is defective/misconfigured, in that case, we mark it as unreachable at the Task level. In addition to that, if we have not launched the remote commands yet, they are re-distributed to other available gateways (if any).

Now, I don't remember why I decided to treat rc=None as an error. It's probably a defect. I'm proposing a PR to fix this. While I believe I have now a good understanding of how it works, there is one use case that I'm not 100% sure: what if the gateway is working but suddenly we receive some garbage from the gateway that we can't parse and we then decide to abort it? In that case, we won't see it as an error (rc will be None), and we won't be able to distinguish this from a normal gateway abort because it has not purpose anymore. If this happens, we won't mark the gateway as unreacheable. It's probably not be a big deal but I wanted to mentioned it so we can think about it a bit more. Maybe an additional flag is needed to keep track of a normal/abnormal close.

thiell avatar Aug 04 '25 22:08 thiell