openwrt icon indicating copy to clipboard operation
openwrt copied to clipboard

Time is not synced on (re)boot automatically after a sysupgrade to 22.03.0

Open birdie-github opened this issue 3 years ago • 22 comments

I don't want to file a bug report under packages because I've not been able to find the package for ntpd. I'm running OpenWrt 22.03.0 r19685-512e76967f . My device is TP-Link TL-WDR3600 v1. After a sysupgrade the date was set to September 3rd, 2022. Today it's the 29th of September.

Of course opkg update failed with wget returned 5.

I couldn't find a command to sync the time in a SSH session, so I simply ran /etc/init.d/sysntpd restart which fixed the issue. I waited at the very least 10 minutes for the time to sync automatically. Didn't happen.

Please make sure after a sysupgrade time is indeed getting synced.

There are zero ntp related messages in the System Log.

I've never touched any NTP related settings, everything is AFAIK by default. My Internet connection is working perfectly.

ntp

birdie-github avatar Sep 29 '22 09:09 birdie-github

I've had the same issue for a few months now. This seems to have been a post-21 regression of some sort. It's definitely a dependency management issue, where some network access dependency is missing from /etc/init.d/sysntpd; I adjusted /etc/init.d/sysntpd:

 start_service() {                                   
         . /lib/functions/network.sh                 
+        sleeptime=20                                 
+        sleep $sleeptime                                   
+        logger "slept $sleeptime"                         
         validate_ntp_section ntp start_ntpd_instance      
 }

... and rebooted, getting:

[mkennedy@imp ~]$ ssh [email protected] "logread | grep slept; date;"; date
Tue Nov 22 17:44:27 UTC 2022
Tue Nov 22 12:46:35 PM EST 2022
[mkennedy@imp ~]$ ssh [email protected] "logread | grep slept; date;"; date
Tue Nov 22 17:44:27 2022 user.notice root: slept 20
Tue Nov 22 17:44:28 UTC 2022
Tue Nov 22 12:46:36 PM EST 2022
[mkennedy@imp ~]$ ssh [email protected] "logread | grep slept; date;"; date
Tue Nov 22 17:44:27 2022 user.notice root: slept 20
Tue Nov 22 17:44:28 UTC 2022
Tue Nov 22 12:46:36 PM EST 2022
[mkennedy@imp ~]$ ssh [email protected] "logread | grep slept; date;"; date
Tue Nov 22 17:44:27 2022 user.notice root: slept 20
Tue Nov 22 17:46:38 UTC 2022 ## <-- sysntpd successful
Tue Nov 22 12:46:38 PM EST 2022

... but when I set sleeptime=1:

[mkennedy@imp ~]$ ssh [email protected] "logread | grep slept; date;"; date
Tue Nov 22 17:48:41 2022 user.notice root: slept 1
Tue Nov 22 17:48:48 2022 user.notice root: slept 1
Tue Nov 22 17:48:50 UTC 2022
Tue Nov 22 12:49:31 PM EST 2022
[mkennedy@imp ~]$ ssh [email protected] "logread | grep slept; date;"; date
Tue Nov 22 17:48:41 2022 user.notice root: slept 1
Tue Nov 22 17:48:48 2022 user.notice root: slept 1
Tue Nov 22 17:48:56 UTC 2022
Tue Nov 22 12:49:37 PM EST 2022
[mkennedy@imp ~]$ ssh [email protected] "logread | grep slept; date;"; date
Tue Nov 22 17:48:41 2022 user.notice root: slept 1
Tue Nov 22 17:48:48 2022 user.notice root: slept 1
Tue Nov 22 17:49:02 UTC 2022 ## <-- sysntpd never succeeds in updating the time
Tue Nov 22 12:49:43 PM EST 2022

... sysntpd consistently does not successfully update.

Hurricos avatar Nov 22 '22 17:11 Hurricos

Two problems:

  • systntpd does not wait for network access to attempt to get the current time
  • sysntpd's procd triggers are sufficient to invoke start_service, but this does not actually cause the service to re-attempt an NTP query (either by restarting the daemon or sending it a signal, etc.).

Hurricos avatar Nov 22 '22 18:11 Hurricos

There's a third problem here: sysntpd's procd triggers are not always sufficient to invoke start_service in the case of a network state change if use_dhcp is off -- that is, we only have a network-state-trigger to reload if use_dhcp is on.

Unrelatedly, there's a hotplug race-condition in the network -> sysntpd /etc/init.d/network reload (at least when the DHCP server provides an NTP server for us).

Edit: To note, this third problem is most easily packaged as, "network updates don't always trigger a change in the procd line and thus restart the daemon upon reload". It is not a good expectation, however, to have the "reload" trigger a new attempt at connecting if the problem is not with a change of configuration at all. So, 'problems' 2 and 3 really don't matter here; detection of changed procd command line works fine and has its own specific purpose.

Hurricos avatar Nov 22 '22 20:11 Hurricos

A fourth problem might be that although busybox ntpd does not produce errors if it fails to act as an NTP client. Looking into what exactly fails in ntpd now.

Of the two sets (1 vs 4) of valid problems, only one need be solved.

The OpenWrt config with which ntpd doesn't succeed on boot involves an external DNS resolver, so I'm going to guess the issue is some sort of bug with refreshing DNS resolution after a failure that busybox hasn't yet addressed, which would be easiest solved by just waiting until DNS is back, but probably best solved by a patch to busybox.

Hurricos avatar Nov 22 '22 21:11 Hurricos

OK, it's a ujail porting issue combined. If you use an external resolver, you get:

clock_gettime64(CLOCK_REALTIME, {tv_sec=1669151348, tv_nsec=377491500}) = 0
open("/etc/hosts", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/resolv.conf", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
read(3, "", 248)                        = 0
close(3)                                = 0
clock_gettime64(CLOCK_REALTIME, {tv_sec=1669151348, tv_nsec=391753460}) = 0
clock_gettime64(CLOCK_REALTIME, {tv_sec=1669151348, tv_nsec=392073940}) = 0
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
clock_gettime64(CLOCK_REALTIME, {tv_sec=1669151348, tv_nsec=403804320}) = 0
sendto(3, "\306M\1\0\0\1\0\0\0\0\0\0\0010\7openwrt\4pool\3ntp\3"..., 40, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 40
sendto(3, "\2522\1\0\0\1\0\0\0\0\0\0\0010\7openwrt\4pool\3ntp\3"..., 40, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 40
poll([{fd=3, events=POLLIN}], 1, 2500)  = 0 (Timeout)
clock_gettime64(CLOCK_REALTIME, {tv_sec=1669151350, tv_nsec=917659980}) = 0
sendto(3, "\306M\1\0\0\1\0\0\0\0\0\0\0010\7openwrt\4pool\3ntp\3"..., 40, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 40
sendto(3, "\2522\1\0\0\1\0\0\0\0\0\0\0010\7openwrt\4pool\3ntp\3"..., 40, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 16) = 40
poll([{fd=3, events=POLLIN}], 1, 2500)  = 0 (Timeout)
clock_gettime64(CLOCK_REALTIME, {tv_sec=1669151353, tv_nsec=422786220}) = 0
close(3)                                = 0
write(2, "ntpd: bad address '0.openwrt.poo"..., 43) = 43
clock_gettime64(CLOCK_REALTIME, {tv_sec=1669151353, tv_nsec=423631820}) = 0
poll([], 0, 143000n)                     = 0 (Timeout)
clock_gettime64(CLOCK_REALTIME, {tv_sec=1669151496, tv_nsec=524269380}) = 0
fork()                                  = 2
clock_gettime64(CLOCK_REALTIME, {tv_sec=1669151496, tv_nsec=526236600}) = 0
poll([], 0, 109000)                     = 0 (Timeout)

What I don't understand is this bit:

open("/etc/resolv.conf", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
read(3, "", 248)                        = 0
close(3)                                = 0

Why do I get a successful open, but nevertheless read 0 after attempting to read 248B?

Hurricos avatar Nov 22 '22 21:11 Hurricos

OK. The problem is that for systems which do not use dnsmasq for DNS caching, sysntpd starts (and its ujail is set up) before /tmp/resolv.conf.d/resolv.conf.auto (the ultimate target of /etc/resolv.conf) is finalized.

The first run of sysntpd after booting:

root@ap3825i0:/proc/2931# grep resolv mountinfo
43 37 0:16 /resolv.conf.d/resolv.conf.auto//deleted /etc/resolv.conf ro,noatime - tmpfs tmpfs rw

After a restart of sysntpd:

root@ap3825i0:/proc/3540# grep resolv mountinfo
43 37 0:16 /resolv.conf.d/resolv.conf.auto /etc/resolv.conf ro,noatime - tmpfs tmpfs rw

This affects all OpenWrt versions post-ujail-transition, but it's not necessarily showing up on all of them, usually because dnsmasq is running and ntpd.c will default to querying 127.0.0.1:53 should it fail elsewhere.

Hurricos avatar Nov 22 '22 22:11 Hurricos

... and that is this: https://github.com/openwrt/openwrt/issues/10389

@birdie-github I'd appreciate if you can confirm that on first boot, if you find the process ID $PID of your ntpd with e.g. ps www | grep ntpd | grep -v ujail, that grep resolv /proc/$PID/mountinfo gives you roughly 43 37 0:16 /resolv.conf.d/resolv.conf.auto//deleted /etc/resolv.conf ro,noatime - tmpfs tmpfs rw and not 43 37 0:16 /resolv.conf.d/resolv.conf.auto /etc/resolv.conf ro,noatime - tmpfs tmpfs rw; that way, we can move on to merging this with #10389.

Hurricos avatar Nov 22 '22 22:11 Hurricos

The fix for now for this would be to provide DHCP option 42 on your network. When you have a network reload of any kind, the loss and regain of the additional IP address in option 42 will cause two reloads of ntpd in tandem with the recreation of /tmp/resolv.conf.d/resolv.conf.auto, thus avoiding the issue of the file being deleted.

Hurricos avatar Nov 23 '22 00:11 Hurricos

Correction: two restarts. Reload only causes a restart when the resulting configuration is different.

Hurricos avatar Nov 23 '22 00:11 Hurricos

Workaround

Start ntpd directly, without jail

# mv /etc/capabilities/ntpd.json /etc/capabilities/ntpd.json-disabled
# service sysntpd restart

Details: When will sysntpd sync time?

odon-noda avatar Jan 06 '23 19:01 odon-noda

Image

My router is also incapable of using NTP and loses the time on every reboot.

No issues are reported in the Web UI at all despite the option clearly not working.

EDIT: Looks like next release will have a fix - https://github.com/FreifunkFranken/firmware/commit/f48e53cd955de7d5f8d0412e6c93e750ef19928e

C0rn3j avatar Jan 19 '25 16:01 C0rn3j

I came across this issue in 24.10.0 while upgrading TP-Link Deco M4R v1. I can confirm that the problem gets resolved after running /etc/init.d/sysntpd restart and waiting a few seconds.

lhotari avatar Feb 13 '25 10:02 lhotari

@lhotari I presume the issue is gone after you upgrade to 24.10?

@fblaese and I suppose this issue can be closed now that 24.10 is out?

C0rn3j avatar Feb 13 '25 10:02 C0rn3j

@lhotari I presume the issue is gone after you upgrade to 24.10?

@C0rn3j I actually meant that I upgraded to 24.10.0 on TP-Link Deco M4R v1 and I hit the issue. There's a capture of a session where this problem can be seen: https://gist.github.com/lhotari/a196857a664e0f267e1dabc7018559fc . After upgrading the clock didn't sync without manually doing it.

lhotari avatar Feb 13 '25 10:02 lhotari

@lhotari the gist doesn't show any significant time change:

...
root@mydeco:~# date
Tue Feb  4 01:26:39 EET 2025
root@mydeco:~# /etc/init.d/sysntpd restart
root@mydeco:~# date
Tue Feb  4 01:26:45 EET 2025
root@mydeco:~# opkg update
Downloading https://downloads.openwrt.org/releases/24.10.0/targets/ath79/generic/packages/Packages.gz
Updated list of available packages in /var/opkg-lists/openwrt_core
...

I also went back to my own 24.10.0 device and couldn't reproduce this this (although, in that case it's been up for days so the log ring buffer has lost the initial entries so couldn't check for any evidence from its early uptime).

What time (roughly) was the gist text taken? Is it possible to reproduce with the addition of a date command after the successful opkg commands?

dimitris-personal avatar Feb 13 '25 19:02 dimitris-personal

@lhotari the gist doesn't show any significant time change:

@dimitris-personal That's true. I'm assuming that the time got updated just before running opkg update. I didn't record it, but the date was today's date after I checked it.

I also went back to my own 24.10.0 device and couldn't reproduce this this (although, in that case it's been up for days so the log ring buffer has lost the initial entries so couldn't check for any evidence from its early uptime).

I was able to reproduce with TP-Link Deco M4R v1. When I upgraded 24.10.0 for a Cudy WR3000, there wasn't such issue.

What time (roughly) was the gist text taken? Is it possible to reproduce with the addition of a date command after the successful opkg commands?

The gist text is copy-pasted from a terminal a few minutes before creating the gist. As mentioned above, the date was correct after I checked. I didn't check it immediately since I first assumed that magically restarting sysntpd did something else like syncing a hardware clock etc. It seems that Deco M4R v1 doesn't have a hardware clock at all. The only reasonable explanation seems to be that sysntpd made the update just before I ran opkg update.

I can try to reproduce again on Deco M4R v1. The same issue happened twice that even if the device was running for a long time, the time wasn't correct and opkg update would fail. It seems that after rebooting it, it also takes some time until the time gets updated. Perhaps it's not directly related to the upgrade but can sporadically happen after any reboot?

lhotari avatar Feb 13 '25 19:02 lhotari

I tried reproducing again on TP-Link Deco M4R v1. The problem didn't reproduce this time. It seems that the behavior is flaky.

lhotari avatar Feb 14 '25 07:02 lhotari

Hi, just wanted to report that this is still happening on GL.iNet Flint 2 (GL-MT6000) with OpenWrt 24.10.0 r28427-6df0e3d02a / LuCI openwrt-24.10 branch 25.049.66344~2b8e93c.

webknjaz avatar Feb 27 '25 15:02 webknjaz

Experiencing this on 24.10 - in my case the wan is a bond connection, and there seems to be a race condition or insufficient event triggering for sysntpd to be started after the bond has slave connections. Working around via rc.local with:

...
sleep 180
/etc/init.d/sysntpd restart || /bin/true
...

quinndiggity avatar Apr 11 '25 18:04 quinndiggity

What is odd is that sysntpd never recovers after connectivity is established or lost + resumed - the service continues to run, while not performing its purpose

quinndiggity avatar Apr 11 '25 19:04 quinndiggity

I think, though maybe wrong, that NTP failing to sync after reboot is what's causing my switch to be non-accessible after reboot. It always restarts thinking it's about (but not exactly) a year ago. I only know this because one of the first things I reconfigured after factory resetting it for the umpteenth time, was remote logging. I know SSH will block if the times don't match, but I'm surprised LuCI also blocks (but maybe a year is enough for certs to be garbage). It didn't help using local or remote NTP. I'm on 23.05 on all my 4 network devices but only the DGS-1210-16 switch fails to sync at all (since a week ago).

shikasta-net avatar May 12 '25 11:05 shikasta-net

For anyone using syslog-ng and pbr you can do follow

/etc/syslog-ng.d/sysntpd.conf

filter f_sysntpd {
  match("(.*)Setting interface trigger for wan(.*)", value("MESSAGE"));
};

destination d_sysntpd {
  program("/etc/commands/sysntpd_restart.sh" template("<${PRI}>${DATE} ${HOST} ${MESSAGE}\n"));
};

log {
  source(src);
  filter(f_sysntpd);
  destination(d_sysntpd);
};

/etc/commands/sysntpd_restart.sh

#!/bin/sh

while read line ; do
  sleep 5
  logger -t sysntpd "Restarting sysntpd for sync time"
  /etc/init.d/sysntpd restart
done

quyenvsp avatar Jun 28 '25 07:06 quyenvsp