DeepSea icon indicating copy to clipboard operation
DeepSea copied to clipboard

Checking DeepseaMinions has 45 second lag if _any_ nodes are down

Open tserong opened this issue 5 years ago • 0 comments

Description of Issue/Question

Discovered while working on #1403. DeepseaMinions runs a pillar_refresh on all nodes, then "pillar.get deepsea_minions" on all nodes, then "pillar.get id" on all nodes. These things happen one after another, and usually complete within 5 seconds in my testing. However, if one or more nodes are down, each operation will time out after 15 seconds, meaning the whole thing takes close to 50 seconds to complete.

Why the pillar_refresh? I guess, to ensure the value in /srv/pillar/ceph/deepsea_minions.sls is up to date? That seems OK, but why pillar_refresh all nodes, and why "pillar.get deepsea_minions" on all nodes? Why not just pillar_refresh and "pillar.get deepsea_minions" on the master minion. That will always be up by definition, since runners are invoked on the master; then we'd be down to a 15 second delay if any nodes were down, caused by the "pillar.get id" timing out. That could also be improved by not setting self.matches in __init__ and instead having clients call the _matches() function directly, so at least the query would only be performed in cases where it was necessary to see the result of this (I'm pretty sure that is only used by the validate runner, and not anywhere else).

Steps to Reproduce Issue

  • Stop the salt minion process on one or more nodes
  • Run time salt-run deepsea_minions.show on the master

Versions Report

salt 2018.3.0 deepsea master branch (same will be true on SES5 branch)

tserong avatar Oct 03 '18 10:10 tserong