gravity icon indicating copy to clipboard operation
gravity copied to clipboard

ping latency checker reports failure before collecting enough data points on a new node

Open helgi opened this issue 3 years ago • 1 comments

Description

What happened:

Spinning up new nodes the ping latency triggers from time to time pops up (depends on the AWS region) and it will stick around for a while due to how the sampling is done and the measurement threshold is low

[!] ping between 10_221_0_8.opscenter and 10_221_2_227.opscenter is higher than the allowed threshold of 15ms (ping latency at 46.497791ms)

gravity status -s 10 for a while kept showing me 46.497791ms as if no new data was being pulled vs 46.497791ms being the worst sample in the set.

What you expected to happen:

New nodes are not subject right away to the ping check and that it collects enough data before triggering

From @r0mant we could make the check not kick-in until it has enough data points, but I’m now also thinking we may need to tweak the algorithm parameters a little bit because 95%-ile of 20 basically means it’ll trigger if just 2 out of 20 measurements are higher than 15ms and is going to be stuck in this state until one of them falls out of the sliding window.

Environment

  • Gravity version [e.g. 7.0.11]: 7.0.15
  • OS [e.g. Redhat 7.4]: CentOS 8.2
  • Platform [e.g. Vmware, AWS]: AWS

helgi avatar Aug 27 '20 03:08 helgi

While Roman's comment makes perfect sense to me there's a key point that we should all agree on, which is the percentile that makes sense in this particular check.

I'll set it up to be 75 percentile now, but it can definitely be adjusted as needed before the PR will be merged.

Maybe it should be customizable too?

eldios avatar Dec 28 '20 22:12 eldios