ceph-salt
ceph-salt copied to clipboard
Detect network flakiness before applying the Salt Formula
ceph-salt apply
is known to fail in odd ways when running in an environment with poor network connectivity. These failures can be especially vexxing if the network connections are flakey - i.e., they succeed on some attempts, and fail on others. In such cases, a user might reasonably think that the failure is due to a bug in ceph-salt.
For example:
- when an external time server is configured, and connectivity with that external time server is flakey,
ceph-salt apply
can fail - when the container image path points to a remote registry, and connectivity with that registry is flakey,
ceph-salt apply
can fail. - when ceph-salt attempts to use
zypper
to install packages on nodes, and connectivity with remote zypper repos is flakey,ceph-salt apply
can fail.
It would be nice if we could detect network flakiness before starting to apply the Salt Formula. The purpose of this ticket is to collect ideas for how to do that.
Idea: ping a remote server for a short time (e.g., 30 seconds) and measure packet loss.
CAVEAT: it is possible to configure ceph-salt in such a way that it does not initiate any communication with remote servers:
- time server is local
- container image path points to local registry
- zypper repos are local
In such a case, it would be wrong to try to ping a remote server. So this test should first check how the environment is configured and only ping remote servers if "communication with remote servers" is detected in the configuration.