incubator-pegasus icon indicating copy to clipboard operation
incubator-pegasus copied to clipboard

Add a gauge for the duration since the meta server has received the last beacon

Open empiredan opened this issue 2 years ago • 0 comments

Background

Recently a cluster on production environment was found that primary meta server had frequently disconnected the replica servers, for the reason that the duration since the last beacon from each replica server had been received by the primary meta server was often greater than the grace period (70+ seconds vs. 22 seconds).

The network latency is typically several hundreds of microseconds, which means something must have been wrong for this cluster. After trouble shooting for the root cause, it was found that there are 2 different NTP servers A and B in the configuration. A is slower than B by more than one minute. For example, meta server received a beacon at 12:05:25 from A; then the clocks jumped suddenly to 12:06:35; the meta server found that it has passed far more than the grace period, then disconnected the corresponding replica server.

Implementation

The duration since the meta server has received the last beacon for each replica server can be added as a gauge, to find the exception in the system faster.

empiredan avatar Aug 04 '22 07:08 empiredan