clustershell icon indicating copy to clipboard operation
clustershell copied to clipboard

Bad performance in treemode mode with over 1000 nodes environment

Open luxiaoyong opened this issue 1 year ago • 9 comments

We have an environment with 1000+ nodes where hostnames cannot be folded. When running the "whoami" command with the treemode enabled (using two gateways), it takes 17 minutes, while disabled the treemode, it takes only one minute. The treemode operation is excessively slow. How can we address this issue? In another 1000+ nodes environment where hostnames can be aggregated, executing the "whoami" command with treemode enabled takes 40 seconds. Our investigation has revealed that the main cause of the delay lies in clush releasing nodes slowly, with an average of 0.7 seconds per node, and this process occurs sequentially.

The code below execute slow

def _on_remote_node_close(self, node, rc, gateway):
        ...
        self.gwtargets[str(gateway)].remove(node)
        self._close_count += 1
        self._check_fini(gateway)

Nodes Number cmd treemode cost time disable treemode cost time 1177 whoami 17min9sec 60sec 1 whoami 5sec 0.7sec

luxiaoyong avatar Apr 30 '24 06:04 luxiaoyong