chrony_exporter Socket: i/o timeout and bind: already in use

I switched to running the exporter in the same container as the chronyd server. I'm scraping it every 60s.

What I am seeing:

time=2024-11-04T18:43:46.590Z level=DEBUG source=collector.go:149 msg="Scrape completed" scrape_id=40 seconds=0.004325332
time=2024-11-04T18:44:21.389Z level=DEBUG source=collector.go:146 msg="Scrape starting" scrape_id=41
time=2024-11-04T18:44:21.389Z level=DEBUG source=collector.go:146 msg="Scrape starting" scrape_id=42
time=2024-11-04T18:44:21.389Z level=DEBUG source=collector.go:155 msg="Couldn't connect to chrony" scrape_id=42 address=unix:///run/chrony/chronyd.sock err="dial unixgram /run/chrony/chrony_exporter.7.sock->/run/chrony/chronyd.sock: bind: address already in use"
time=2024-11-04T18:44:21.389Z level=DEBUG source=collector.go:149 msg="Scrape completed" scrape_id=42 seconds=9.6575e-05
time=2024-11-04T18:44:21.389Z level=DEBUG source=tracking.go:176 msg="Got 'tracking' response" scrape_id=41 tracking_packet=SUCCESS
time=2024-11-04T18:44:21.390Z level=DEBUG source=collector.go:191 msg="DNS lookup took" scrape_id=41 seconds=0.000999318
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:186 msg="Tracking Last Offset" scrape_id=41 offset=0.00017267093062400818
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:189 msg="Tracking Ref Time" scrape_id=41 ref_time=2024-11-04T18:43:30.632Z
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:192 msg="Tracking System Time" scrape_id=41 system_time=-0.0014299183385446668
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:199 msg="Tracking is remote" scrape_id=41 bool_value=1
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:202 msg="Tracking RMS Offset" scrape_id=41 rms_offset=0.0017726216465234756
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:205 msg="Tracking Root delay" scrape_id=41 root_delay=0.0168450977653265
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:208 msg="Tracking Root dispersion" scrape_id=41 root_dispersion=0.0026964773423969746
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:211 msg="Tracking Frequency" scrape_id=41 frequency=-0.6470763087272644
time=2024-11-04T18:44:21.390Z level=DEBUG source=tracking.go:214 msg="Tracking Residual Frequency" scrape_id=41 residual_frequency=0.017963755875825882
time=2024-11-04T18:44:21.391Z level=DEBUG source=tracking.go:217 msg="Tracking Skew" scrape_id=41 skew=1.2864940166473389
time=2024-11-04T18:44:21.391Z level=DEBUG source=tracking.go:220 msg="Tracking Last Update Interval" scrape_id=41 update_interval=65.0595703125
time=2024-11-04T18:44:21.391Z level=DEBUG source=tracking.go:223 msg="Tracking Stratum" scrape_id=41 stratum=2
time=2024-11-04T18:44:26.393Z level=DEBUG source=collector.go:182 msg="Couldn't get serverstats" scrape_id=41 err="read unixgram /run/chrony/chrony_exporter.7.sock->/run/chrony/chronyd.sock: i/o timeout"
time=2024-11-04T18:44:26.394Z level=DEBUG source=collector.go:149 msg="Scrape completed" scrape_id=41 seconds=5.005050362
time=2024-11-04T18:44:46.586Z level=DEBUG source=collector.go:146 msg="Scrape starting" scrape_id=43

It works sometimes though. Any idea what could be the issue?

Nov 04 '24 18:11 spaullilly

Important note:

/usr/bin/chrony_exporter --chrony.address="unix:///run/chrony/chronyd.sock" --log.level=debug --collector.serverstats --collector.chmod-socket

Nov 04 '24 18:11 spaullilly

I also set the timeout to 30s and still getting the same errors

Nov 04 '24 19:11 spaullilly

I figured out what was going on with the bind: already in use issue. No need to look further here. Though a caching layer would be helpful here. We have 2 scrapers running and the 2nd request fails with this.

But the I/O is happening randomly (more persistently). Some calls work within a second, while others will get tracking and then timeout on getting serverstats and some will simple fail to get both within the timeout. I have the timeout set to 30s (previously 5s) and at that duration completion, it shows this error.

Maybe this is a issue with chronyd resources?

Nov 06 '24 22:11 spaullilly

I guess the way the unix socket works, you can only make one connection at a time.

We could change the unix socket code to make one shared connection.

Nov 07 '24 08:11 SuperQ

Is it possible that the two scrappers run in the same process (i.e. have the same PID) and are trying to bind to the same local socket? chronyd can handle any number of concurrent clients. The protocol is actually stateless on the server side.

An unrelated question. If there were Go bindings for libchrony, would you be interested in using it in chrony_exporter? That would allow you to easily add support for all the remaining monitoring reports.

May 27 '25 15:05 mlichvar

An unrelated question. If there were Go bindings for libchrony, would you be interested in using it in chrony_exporter?

No, I would not. I much prefer pure Go implementations. C libraries have a number of problems.

The C library functions run outside the Go runtime's observability and management.
It's much more difficult to ship static binaries that work on any system.

May 27 '25 16:05 SuperQ

Is it possible that the two scrappers run in the same process (i.e. have the same PID) and are trying to bind to the same local socket?

Yes, that's what is likely happening here. The way the exporter is written today each scrape opens a new unix socket connection. If two Prometheus instances scrape at the same time, it will try and open up two sockets simultaneously.

This is possible to be fixed. The exporter would have to open a connection and setup a mutex lock so each scrape would happen serially.

May 27 '25 16:05 SuperQ

You could also replace the PID number with a random or increasing number and before binding check if there is an existing socket and if it is still used. At least that's how it's done in libchrony.

May 27 '25 16:05 mlichvar

Hmm, oh, I see. The actual problem is the client socket filename is having collisions. That should be fixable by using a more mktemp style generator.

I have a couple ideas on how to fix this.

May 27 '25 17:05 SuperQ