chrony_exporter icon indicating copy to clipboard operation
chrony_exporter copied to clipboard

chrony_tracking_root_dispersion_seconds probably doesn't belong in recommended error metrics (from the README)

Open scottlaird opened this issue 9 months ago • 2 comments

chrony_tracking_root_dispersion_seconds tracks the root dispersion, which on a default Chrony instance can have wildly inaccurate values. The root dispersion includes an assumption of at least 1 PPM in local clock drift (see the Chrony FAQ). Since Chrony defaults to polling NTP servers every 2^10 seconds (1024s, or ~17 minutes), that 1 PPM drift appears as a sawtooth wave in chrony_tracking_root_dispersion_seconds ranging from ~0 (at poll time) to 1.024ms (just before the next poll).

That's a lot of error if you have a local NTP source that you're syncing to.

I've been trying to improve local NTP accuracy, and root dispersion really threw me for a loop. Here's how the 3 factors listed in the README performed with a default config over 12h:

Image

Using the formula from the README gave about a ~1ms P99 error rate for this system.

Looking at just the last-offset time gave a totally different view:

Image

(The server from the first graph is purple in this one). It's showing less than a 30 microsecond delta every time Chrony changes its clock, vs a 1 ms error implied by the larger metric.

Chrony's own accuracy examples are using a variant of last-update for their metrics, while also comparing against a second clock. Where does the recommendation for root-delay + root-dispersion + last-offset actually come from? I can't find it searching the Chrony manpages on the link provided.

Background: I'm trying to get my clocks accurate enough for time skew not to be a problem with distributed traces. 1ms is too slow. 10 microseconds is perfectly fine. I spent quite a bit of time chasing errors and really just ended up learning a whole lot about root dispersion metrics. Which is fine, but I'd like to save others from making the same trip.

scottlaird avatar Apr 10 '25 23:04 scottlaird

PRs welcome.

SuperQ avatar Apr 11 '25 05:04 SuperQ

FWIW, the formula comes from the description of the chronyc tracking command output on the chronyc man page:

An absolute bound on the computer’s clock accuracy (assuming the stratum-1 computer is correct) is given by: clock_error <= |system_time_offset| + root_dispersion + (0.5 * root_delay)

https://github.com/SuperQ/chrony_exporter/pull/118

raphaelthomas avatar Apr 30 '25 20:04 raphaelthomas