clustershell
clustershell copied to clipboard
Bad performance in treemode mode with over 1000 nodes environment
We have an environment with 1000+ nodes where hostnames cannot be folded. When running the "whoami" command with the treemode enabled (using two gateways), it takes 17 minutes, while disabled the treemode, it takes only one minute. The treemode operation is excessively slow. How can we address this issue? In another 1000+ nodes environment where hostnames can be aggregated, executing the "whoami" command with treemode enabled takes 40 seconds. Our investigation has revealed that the main cause of the delay lies in clush releasing nodes slowly, with an average of 0.7 seconds per node, and this process occurs sequentially.
The code below execute slow
def _on_remote_node_close(self, node, rc, gateway):
...
self.gwtargets[str(gateway)].remove(node)
self._close_count += 1
self._check_fini(gateway)
Nodes Number cmd treemode cost time disable treemode cost time 1177 whoami 17min9sec 60sec 1 whoami 5sec 0.7sec