ceph-salt icon indicating copy to clipboard operation
ceph-salt copied to clipboard

Detect network flakiness before applying the Salt Formula

Open smithfarm opened this issue 3 years ago • 1 comments

ceph-salt apply is known to fail in odd ways when running in an environment with poor network connectivity. These failures can be especially vexxing if the network connections are flakey - i.e., they succeed on some attempts, and fail on others. In such cases, a user might reasonably think that the failure is due to a bug in ceph-salt.

For example:

  • when an external time server is configured, and connectivity with that external time server is flakey, ceph-salt apply can fail
  • when the container image path points to a remote registry, and connectivity with that registry is flakey, ceph-salt apply can fail.
  • when ceph-salt attempts to use zypper to install packages on nodes, and connectivity with remote zypper repos is flakey, ceph-salt apply can fail.

It would be nice if we could detect network flakiness before starting to apply the Salt Formula. The purpose of this ticket is to collect ideas for how to do that.

smithfarm avatar Sep 12 '20 07:09 smithfarm

Idea: ping a remote server for a short time (e.g., 30 seconds) and measure packet loss.

CAVEAT: it is possible to configure ceph-salt in such a way that it does not initiate any communication with remote servers:

  • time server is local
  • container image path points to local registry
  • zypper repos are local

In such a case, it would be wrong to try to ping a remote server. So this test should first check how the environment is configured and only ping remote servers if "communication with remote servers" is detected in the configuration.

smithfarm avatar Sep 12 '20 14:09 smithfarm