systemd icon indicating copy to clipboard operation
systemd copied to clipboard

Lease getting expired without retrying at T2

Open khhizr007 opened this issue 1 year ago • 3 comments

systemd version the issue has been seen with

254

Used distribution

Ubuntu 22.04

Linux kernel version used

6.5.0-1022-aws

CPU architectures issue was seen on

x86_64

Component

systemd-networkd

Expected behaviour you didn't see

The lease renewal should take place at T2 rebinding time so that the address lease is extended.

Unexpected behaviour you saw

We have noticed from the past couple of weeks that are application running on aws ec2 goes down abruptly and we are no longer able to even SSH into our application.

On investigating our system logs we found that the issue is stemming from the lease expiration for the ip address that our instance leases from the dhcp server. This can be seen from the error messages as seen below:

systemd-networkd[328]: ens5: Could not set DHCPv4 address: Connection timed out

Also after a while we can see this error about the deletion of the lease in the logs:

Deleting interface #3 ens5, 172.o0.1.xx6#123, interface stats: received=4556, sent=6992, dropped=0, active_time=1045188 secs

This error is faced when a couple of crons are running in the background and the system is under stress. We have a high level of read operations taking place on our disk during this time when we face this error. Though this is not root cause of concerns in our system but more of a symptom of something going awry with our application, I still think this needs some addressal.

So getting back to my point, even if the lease was not able to be secured at the T1 time of renewal there should be rebinding attempt taking place at T2 time. But this is not seen and our application becomes unreachable for us and we need to do a manual reboot of it.

We have already tried switching our ubuntu versions but this is still not solved. There past issues as well here which have raised a similar concern about the expiration of lease.

Steps to reproduce the problem

Try having high IOPS operations running in background in your system at the time the lease is bound to be renewed through the dhcp server.

Additional program output to the terminal or log subsystem illustrating the issue

No response

khhizr007 avatar Aug 04 '24 13:08 khhizr007

Deleting interface #3 ens5, 172.o0.1.xx6#123, interface stats: received=4556, sent=6992, dropped=0, active_time=1045188 secs

This is not from networkd. How do you get this?

yuwata avatar Aug 05 '24 00:08 yuwata

Sorry for the late reply @yuwata, this log is from the ntpd service. The other log that I shared is from systemd. I just wanted to provide a more clearer image for better understanding of the problem.

khhizr007 avatar Aug 06 '24 18:08 khhizr007

Hi @yuwata , would you need any other logs/details for this. Would be more than happy to be of help.

khhizr007 avatar Aug 22 '24 14:08 khhizr007

We experience the same issue in Flatcar: https://github.com/flatcar/Flatcar/issues/1736 Is there a possible workaround?

kayrus avatar May 05 '25 12:05 kayrus

Hi @kayrus , thanks for chiming in. I believe this issue hasn’t been fully addressed yet.

From our experience, this problem tends to manifest under system load. In our case, we traced it to suboptimal SSD performance on AWS—specifically, volumes with low IOPS. This led to increased I/O wait times and overall system stress, during which the DHCP lease renewal process fails. But it should still try to do this at T2 binding time which it fails to do and the system loosed the lease, thereby going offline without an ip.

What I would suggest for more immediate remedial is to look for similar issues in your system which might be causing the lease renewal to fail in the first place.

That said, this does appear to be an unresolved issue within systemd, especially considering the behavior we’ve observed.

khhizr007 avatar May 21 '25 08:05 khhizr007

correct. in both occurrences the IO operations have been increased

Image Image

kayrus avatar May 21 '25 08:05 kayrus

Just to add a data point that I have observed a similar issue (high IO due to swapping). https://github.com/systemd/systemd/issues/32045#issuecomment-2933000042

caizixian avatar Jun 03 '25 00:06 caizixian