chrony_tracking_root_dispersion_seconds probably doesn't belong in recommended error metrics (from the README)
chrony_tracking_root_dispersion_seconds tracks the root dispersion, which on a default Chrony instance can have wildly inaccurate values. The root dispersion includes an assumption of at least 1 PPM in local clock drift (see the Chrony FAQ). Since Chrony defaults to polling NTP servers every 2^10 seconds (1024s, or ~17 minutes), that 1 PPM drift appears as a sawtooth wave in chrony_tracking_root_dispersion_seconds ranging from ~0 (at poll time) to 1.024ms (just before the next poll).
That's a lot of error if you have a local NTP source that you're syncing to.
I've been trying to improve local NTP accuracy, and root dispersion really threw me for a loop. Here's how the 3 factors listed in the README performed with a default config over 12h:
Using the formula from the README gave about a ~1ms P99 error rate for this system.
Looking at just the last-offset time gave a totally different view:
(The server from the first graph is purple in this one). It's showing less than a 30 microsecond delta every time Chrony changes its clock, vs a 1 ms error implied by the larger metric.
Chrony's own accuracy examples are using a variant of last-update for their metrics, while also comparing against a second clock. Where does the recommendation for root-delay + root-dispersion + last-offset actually come from? I can't find it searching the Chrony manpages on the link provided.
Background: I'm trying to get my clocks accurate enough for time skew not to be a problem with distributed traces. 1ms is too slow. 10 microseconds is perfectly fine. I spent quite a bit of time chasing errors and really just ended up learning a whole lot about root dispersion metrics. Which is fine, but I'd like to save others from making the same trip.
PRs welcome.
FWIW, the formula comes from the description of the chronyc tracking command output on the chronyc man page:
An absolute bound on the computer’s clock accuracy (assuming the stratum-1 computer is correct) is given by:
clock_error <= |system_time_offset| + root_dispersion + (0.5 * root_delay)
https://github.com/SuperQ/chrony_exporter/pull/118