incubator-pegasus
incubator-pegasus copied to clipboard
Add a gauge for the duration since the meta server has received the last beacon
Background
Recently a cluster on production environment was found that primary meta server had frequently disconnected the replica servers, for the reason that the duration since the last beacon from each replica server had been received by the primary meta server was often greater than the grace period (70+ seconds vs. 22 seconds).
The network latency is typically several hundreds of microseconds, which means something must have been wrong for this cluster. After trouble shooting for the root cause, it was found that there are 2 different NTP servers A and B in the configuration. A is slower than B by more than one minute. For example, meta server received a beacon at 12:05:25 from A; then the clocks jumped suddenly to 12:06:35; the meta server found that it has passed far more than the grace period, then disconnected the corresponding replica server.
Implementation
The duration since the meta server has received the last beacon for each replica server can be added as a gauge, to find the exception in the system faster.