Flatcar icon indicating copy to clipboard operation
Flatcar copied to clipboard

[RFE] support chrony or support NTPD as default instead of sntpd for AWS ami's

Open shankar-vng opened this issue 1 year ago • 8 comments

Current situation

Flat car Ami released for AWS cloud by default use SNTP as the time server instead of chrony or NTP which resolve upto multiple ms accuracy.

We checked 2 instance & noticed offset of about < 250 ms we did not notice any use SNTP config, atleast based on the OS config. The problem we noticed was with the interface in the path having resolution until Seconds but not in ms with SNTP

Flat car OS uses Systemd-timesyncd & i’m unable find any flag or config which can remove the offset to ms accuracy btw nodes. We could not find any way to set the time Precision with SNTP but in any case, the OS must resolve time to ms accuracy by default

$ timedatectl show-timesync --all
LinkNTPServers=
SystemNTPServers=
RuntimeNTPServers=
FallbackNTPServers=0.flatcar.pool.ntp.org 1.flatcar.pool.ntp.org 2.flatcar.pool.ntp.org 3.flatcar.pool.ntp.org
ServerName=0.flatcar.pool.ntp.org
ServerAddress=167.172.70.21
RootDistanceMaxUSec=5s
PollIntervalMinUSec=32s
PollIntervalMaxUSec=34min 8s
PollIntervalUSec=4min 16s
NTPMessage={ Leap=0, Version=4, Mode=4, Stratum=2, Precision=-23, RootDelay=1.296ms, RootDispersion=47.546ms, Reference=6D31CFAE, OriginateTimestamp=Thu 2024-02-01 11:23:41 UTC, ReceiveTimestamp=Thu 2024-02-01 11:23:41 UTC, TransmitTimestamp=Thu 2024-02-01 11:23:41 UTC, DestinationTimestamp=Thu 2024-02-01 11:23:41 UTC, Ignored=no, PacketCount=3, Jitter=20.034ms }
Frequency=-12022283

Impact

Machine time offset varies btw < 250 ms

Ideal future situation

Support chrony or enable NTPD by default in AWS ami to resolve the accuracy issue

Additional information

Addition github issues reported & references

shankar-vng avatar Feb 05 '24 15:02 shankar-vng

hi @shankar-vng - this seems weird.

how did you determine that the instance clocks are off by 250ms?

have you checked if the situation is better when using ntpd? if so, please consider opening an issue with https://github.com/systemd/systemd because that may be an upstream issue.

jepio avatar Feb 05 '24 16:02 jepio

can you paste timedatectl timesync-status from both instances?

jepio avatar Feb 05 '24 16:02 jepio

We had a similar topic with Azure where we documented how to use chrony through docker: https://www.flatcar.org/docs/latest/installing/cloud/azure/#use-the-azure-hyper-v-host-for-time-synchronisation-instead-of-ntp

pothos avatar Feb 06 '24 16:02 pothos

@jepio Thank for your response. Reply in-line

  • how did you determine that the instance clocks are off by 250ms?

Our container logs running on different machine had timestamp difference of south of or < 200m (not always 200ms). The offset varies based on resolution & DNS. Here is the requested status.

Machine 1

 timedatectl timesync-status
       Server: 167.71.195.165 (0.flatcar.pool.ntp.org)
Poll interval: 34min 8s (min: 32s; max 34min 8s)
         Leap: normal
      Version: 4
      Stratum: 3
    Reference: 907EF2B0
    Precision: 1us (-24)
Root distance: 26.923ms (max: 5s)
       Offset: -732us
        Delay: 1.468ms
       Jitter: 1.238ms
 Packet count: 243
    Frequency: +6.202ppm
______________________________________
Machine 2
$ timedatectl timesync-status
       Server: 47.241.41.246 (0.flatcar.pool.ntp.org)
Poll interval: 34min 8s (min: 32s; max 34min 8s)
         Leap: normal
      Version: 4
      Stratum: 2
    Reference: 64643D58
    Precision: 1us (-24)
Root distance: 63.178ms (max: 5s)
       Offset: +1.901ms
        Delay: 2.661ms
       Jitter: 1.852ms
 Packet count: 243
    Frequency: +7.312ppm
_______________________________________
Machine 3

       Server: 172.104.44.120 (0.flatcar.pool.ntp.org)
Poll interval: 34min 8s (min: 32s; max 34min 8s)
         Leap: normal
      Version: 4
      Stratum: 2
    Reference: 768F1153
    Precision: 1us (-25)
Root distance: 38.428ms (max: 5s)
       Offset: +528us
        Delay: 1.391ms
       Jitter: 609us
 Packet count: 243
    Frequency: +7.618ppm
_______________________________________
Machine 4
$ timedatectl timesync-status
       Server: 106.10.186.200 (0.flatcar.pool.ntp.org)
Poll interval: 34min 8s (min: 32s; max 34min 8s)
         Leap: normal
      Version: 4
      Stratum: 2
    Reference: 6A0A9885
    Precision: 1us (-25)
Root distance: 221us (max: 5s)
       Offset: -627us
        Delay: 1.884ms
       Jitter: 1.540ms
 Packet count: 243
    Frequency: +20.541ppm

I understand that this is a systemD issue but iwhen it comes to ami's for cloud, then it is a wise option to use some of the cloud provider defaults used in ami's

shankar-vng avatar Feb 07 '24 15:02 shankar-vng

I see the issue now: systemd-timesyncd only syncs with a single ntp server, and it implements SNTP not NTP. From man systemd-timesyncd:

       The systemd-timesyncd service implements SNTP only. This
       minimalistic service will step the system clock for large offsets
       or slowly adjust it for smaller deltas. Complex use cases that
       require full NTP support (and where SNTP is not sufficient) are
       not covered by systemd-timesyncd.

@pothos how about we rethink the default configuration to use? We might even want to add chrony to azure OEM for ptp and switch on AWS sync to the local NTP/PTP source https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html#ptp-hardware-clock-requirements.

jepio avatar Feb 07 '24 15:02 jepio

@jepio please let me know if I can help in anyway in pushing the changes to Aws Ami's before end of Q1. Kindly point me to the relevant documentation 🙏

shankar-vng avatar Feb 12 '24 07:02 shankar-vng

It's a matter of figuring out how to implement the change in the AWS OEM sysext without disrupting other platforms. To start you would need to build your own images for testing: https://www.flatcar.org/docs/latest/reference/developer-guides/sdk-modifying-flatcar/.

I can't promise that anyone will have time to look at this in Q1, we're all busy with other issues.

jepio avatar Feb 14 '24 07:02 jepio

We merged https://github.com/flatcar/scripts/pull/1792 which implements this change for GCP/AWS/Azure. This will be released in the alpha channel in april.

jepio avatar Mar 28 '24 13:03 jepio

@shankar-vng this reached stable just now (3975.2.0).

jepio avatar Aug 08 '24 10:08 jepio