dhcpcd
dhcpcd copied to clipboard
dhcpcd expires v6 delegated prefix at t2 instead of at lease end
When the DHCPv6 server is not responding then dhcpcd is expiring the prefix lease at t2 instead of at the lease expiration time.
In the most recent example these are events: Sun Jan 22 23:46:19 EST 2023: BOUND6 n:'2600:4041:xxxx:xx00::/56' o:'' Mon Jan 23 01:22:29 EST 2023: EXPIRE6 n:'' o:'2600:4041:xxxx:xx00::/56'
t1 was 3600, t2 was 5760. So dhcpcd immediately expires the lease once the first rebind fails.
There was no link drop, and the router is continuously announces (i.e., the v6 address is still working). So I would not expect this premature expiration.
I have not yet tested the behavior for a v4 lease.
The same happens after dhcpcd -n is run while the DHCPv6 server is not responding, independent of the age of the lease.
What version are you using?
This happens with 9.4.1.
As far as I read the code it's because if PD is true dhcp6_startrebind has a timeout with dhcp6_failrebind which then calls dhcp6_fail which runs EXPIRE6 and starts a new discover unless DHCPCD_LASTLEASE_EXTEND is set. That option should not be necessary as the lease is not yet expired.
Earlier versions (e.g., 7.1.0) avoided parts of that if DHCPCD_LASTLEASE is set. But even that seems not correct as expiration should not happen even it -E is not set.
I'd be happy to dig into this a bit more maybe next weekend and see if I can come up with a patch (I have one for the different but related issues with 7.1.0).
My current thoughts: Best guess is the dhcp6_fail code only expects to be called if the lease is actually expired, but for PD that is not true. And I assume dhcp6_startrebind is adding the timeout for dhcp6_failrebind with PD because of the possibility of arriving in dhcp6_startrebind with some of the conditions of RFC3633, section 12.1 met. The revalidation should not be needed if the state while entering dhcp6_startrebind is either DH6S_BOUND or DH6S_RENEW (the network connection can be assumed to have been stable as a link flap would have caused expiration immediately). In these cases dhcp6_failrebind should not be added as a timeout. It would only be used on restart of dhcpcd, just in case a section 12.1 condition is true. Though I would consider an exception for DHCPCD_LASTLEASE being set, but I am not sure that is rally aligned with the spirit of the option.
I have a similar issue with IPv6 completely dropping whenever the lifetime expires. Looking at what's received (renew in 1800, rebind in 2880, expire in 3600 seconds) suggests that preferred_lft should be 2880 and valid_lft 3600, but both are assigned 3600 by dhcpcd, which leaves no leaway for phasing out old addresses, instead resulting in a complete drop of IPv6 every 3600 seconds.
The lifetime values seem correct to me as that is the lifetime of the lease. The lease should be renewed every 1800s, which should reset the lifetime values to 3600. Otherwise IPv6 traffic would stop working before the lifetime ends, or an address would be use past the valid lifetime. If the dhcp server does not respond within 3600s then dropping the address seems correct as you no longer have a valid lease. Is the renew/rebind properly starting at 1800s/2880s?
The basic idea is supposed to be that an IPv6 address remains valid for longer than is it preferred. This is what prevents total drops and allows to start negotiating renewals early enough. This fails here, because the delegated interface gets 3600 for both valid and preferred lifetimes, even though the message received (renew in 1800, rebind in 2880, expire in 3600 seconds) suggests that there would be plenty of leeway before it completely expires.
What I currently get:
Jan 29 16:37:33 p8h61 dhcpcd[588]: enp4s0: ADV 2001:14ba:a016:6f00::/56 from fe80::bb Jan 29 16:37:33 p8h61 dhcpcd[588]: enp4s0: REPLY6 received from fe80::bb Jan 29 16:37:33 p8h61 dhcpcd[588]: enp4s0: renew in 1800, rebind in 2880, expire in 3600 seconds Jan 29 16:37:33 p8h61 dhcpcd[588]: lo: adding reject route to 2001:14ba:a016:6f00::/56 Jan 29 16:37:33 p8h61 dhcpcd[588]: enp4s0: delegated prefix 2001:14ba:a016:6f00::/56 Jan 29 16:37:33 p8h61 dhcpcd[588]: br0: adding address 2001:14ba:a016:6f07::1/64 Jan 29 16:37:33 p8h61 dhcpcd[588]: br0: adding route to 2001:14ba:a016:6f07::/64 Jan 29 16:37:41 p8h61 dhcpcd[588]: enp4s0: adding default route via fe80::bb Jan 29 17:37:39 p8h61 dhcpcd[588]: br0: pid 0 deleted route to 2001:14ba:a016:6f07::/64 Jan 29 17:52:07 p8h61 dhcpcd[588]: br0: adding route to 2001:14ba:a016:6f07::/64 Jan 29 18:38:04 p8h61 dhcpcd[588]: br0: pid 0 deleted route to 2001:14ba:a016:6f07::/64 Jan 29 18:43:06 p8h61 dhcpcd[588]: br0: adding route to 2001:14ba:a016:6f07::/64
The problem seems to me that the route is preemptively deleted, then added again, whenever the prefix is renewed. IMHO, the route should not get deleted unless an entirely different prefix is received at renewal.
The log message suggests that dhcpcd will try to renew after 1800s. Address lifetimes should not be related to the first renew, but to the lease length. I see the preferred/valid lifetimes for all delegations updated with every renew/rebind dhcpcd receives (e.g., around 1800s I would see them reset to 3600 for both valid and preferred lifetime as expected). Do you know if dhcpcd receives the renew messages from the DHCPDv6 server? Do you see the expected RENEW6 and DELEGATED6 events (e.g., via hooks)?
I created the following hook:
case "$reason" in BOUND6|REBIND6|RENEW6|DELEGATED6) if command -v logger >/dev/null 2>&1; then logger -p daemon.info -t "dhcpcd" "Info: Logged IPv6 event. Reason: $reason $new_delegated_dhcp6_prefix" fi ;; esac
I sometimes see logs about delegated, but nothing else.
I see BOUND6/REBIND6/RENEW6 right before/after seeing the DELEGATED6. If you do not see these events something seems off.
Jan 30 12:46:53 p8h61 dhcpcd[595]: enp4s0: Router Advertisement from fe80::bb Jan 30 12:46:53 p8h61 dhcpcd[595]: enp4s0: no global addresses for default route Jan 30 12:46:54 p8h61 dhcpcd[595]: enp4s0: ADV 2001:14ba:a020:aa00::/56 from fe80::bb Jan 30 12:46:54 p8h61 dhcpcd[595]: enp4s0: REPLY6 received from fe80::bb Jan 30 12:46:54 p8h61 dhcpcd[595]: enp4s0: renew in 1800, rebind in 2880, expire in 3600 seconds Jan 30 12:46:54 p8h61 dhcpcd[595]: lo: adding reject route to 2001:14ba:a020:aa00::/56 Jan 30 12:46:54 p8h61 dhcpcd[595]: enp4s0: delegated prefix 2001:14ba:a020:aa00::/56 Jan 30 12:46:54 p8h61 dhcpcd[595]: br0: adding address 2001:14ba:a020:aa07::1/64 Jan 30 12:46:54 p8h61 dhcpcd[595]: br0: adding route to 2001:14ba:a020:aa07::/64 Jan 30 12:46:54 p8h61 dhcpcd[1130]: Info: Logged IPv6 event BOUND6 Jan 30 12:46:56 p8h61 dhcpcd[1188]: Info: Logged IPv6 event DELEGATED6 2001:14ba:a020:aa07::1/64 Jan 30 13:10:56 p8h61 dhcpcd[595]: enp4s0: adding default route via fe80::bb Jan 30 13:16:54 p8h61 dhcpcd[9124]: Info: Logged IPv6 event DELEGATED6 2001:14ba:a020:aa07::1/64 Jan 30 13:16:54 p8h61 dhcpcd[9142]: Info: Logged IPv6 event RENEW6 Jan 30 13:46:54 p8h61 dhcpcd[14025]: Info: Logged IPv6 event DELEGATED6 2001:14ba:a020:aa07::1/64 Jan 30 13:46:54 p8h61 dhcpcd[14043]: Info: Logged IPv6 event RENEW6 Jan 30 13:47:01 p8h61 dhcpcd[595]: br0: pid 0 deleted route to 2001:14ba:a020:aa07::/64 Jan 30 13:55:14 p8h61 dhcpcd[595]: br0: adding route to 2001:14ba:a020:aa07::/64 Jan 30 14:16:54 p8h61 dhcpcd[15864]: Info: Logged IPv6 event DELEGATED6 2001:14ba:a020:aa07::1/64 Jan 30 14:16:54 p8h61 dhcpcd[15882]: Info: Logged IPv6 event RENEW6 Jan 30 14:46:54 p8h61 dhcpcd[15957]: Info: Logged IPv6 event DELEGATED6 2001:14ba:a020:aa07::1/64 Jan 30 14:46:54 p8h61 dhcpcd[15975]: Info: Logged IPv6 event RENEW6 Jan 30 14:47:09 p8h61 dhcpcd[595]: br0: pid 0 deleted route to 2001:14ba:a020:aa07::/64 Jan 30 14:52:17 p8h61 dhcpcd[595]: br0: adding route to 2001:14ba:a020:aa07::/64 Jan 30 15:16:54 p8h61 dhcpcd[16124]: Info: Logged IPv6 event DELEGATED6 2001:14ba:a020:aa07::1/64 Jan 30 15:16:54 p8h61 dhcpcd[16142]: Info: Logged IPv6 event RENEW6
One thing I find odd is that, after a few hours, I end up with 2 routes and 2 defaults for IPv6:
$ ip -6 r ::1 dev lo proto kernel metric 256 pref medium unreachable 2001:14ba:a020:aa00::/56 dev lo proto dhcp metric 1001 pref medium 2001:14ba:a020:aa07::/64 dev br0 proto kernel metric 256 expires 3053sec pref medium 2001:14ba:a020:aa07::/64 dev br0 proto dhcp metric 1007 pref medium fe80::/64 dev br0 proto kernel metric 256 pref medium fe80::/64 dev enp4s0 proto kernel metric 256 pref medium default via fe80::bb dev enp4s0 proto ra metric 1002 pref medium default via fe80::bb dev enp4s0 proto ra metric 1024 expires 3591sec hoplimit 64 pref medium
Seems to me like you have something else messing with the routes ("br0: pid 0 deleted route to 2001:14ba:a020:aa07::/64" I think means dhcpcd noticed something else removed the routes).
In my config dhcpcd does not set up routes with expiration, but that may be due to different config/Os on your side. If the initial routes are set up with expiration (or are being replaced with routes with expiration) that may explain why they are then removed after the initial IP validity expires. The "RENEW6" events should have updated the IP validity, but not sure dhcpcd would touch the expiration of the routes.
Given the timing I see in the logs I'd check the IP validity and routing table right after the initial bind, a few minutes later (just to check if something modified the routes after detecting the new IPv6 address), after the first RENEW, just after the 2nd RENEW, and then a few minutes after the second RENEW (as that seems to be then the route disappears).
Do you see any other log entries for something else changing routes around 12:46-12:47 or 13:46-13:55?
The above routing table answers that one. There are two routes for the same prefix: one added by dhcp and one by the kernel (presumably in response to router advert).
I was talking about the routes on br0. That interface should not accept any RAs...
When the DHCPv6 server is not responding then dhcpcd is expiring the prefix lease at t2 instead of at the lease expiration time.
How are you judging this? dhcpcd just uses T1 and T2 timers for renew and rebind. Expiration is managed of the address lifetimes which are different from T1 and T2.
Expiration happened around 5760s (which is t2) while lifetime is 7200. So basically expiration happens right after the rebind attempt even if the lifetime of the lease is still ~1440s out (technically it's 5760+the timeout).
See also pull request #180 for a fix (or the start of one).
But the lifetime of what exactly? Unlike DHCPv4, DHCPv6 has no exact expiration. T1 and T2 are also not indicators of expiration - they are just for renew and rebind. Even then, just indicators. Defaults if you will. The exact renew, rebind and expiration times are derived from addresses, generally the smallest. https://github.com/NetworkConfiguration/dhcpcd/blob/master/src/dhcp6.c#L2197 https://github.com/NetworkConfiguration/dhcpcd/blob/master/src/dhcp6.c#L2279
Yes, I am aware. What I am saying is that dhcpcd expires the prefix before the lifetime given, and that happens to be at t2 when rebind starts.
From the lease: dhcp6_ia_pd1_prefix1_pltime='7200' dhcp6_ia_pd1_prefix1_vltime='7200'
The code originally does set the expiration callback with a timer of 7200, but when the rebind starts a new short timer is set (due to the interface having delegates) that calls dhcp6_failrebind, which then calls dhcp6_fail, which expires the prefix for all delegates and then goes to the init state.
Given there is a valid lease and there was no carrier drop it seems that the fall back to init state should not happen after the short timeout (i.e., the special handling mentioned in RFC 3633 section 12.1 should not apply).
So are we saying this is only an issue because the pltime is close (in your case exact) to the vltime?
I don't think the relationship between pltime and vltime changes anything here. This happens any time the REBIND state is reached and the interface has delegates.
If pltime were significantly shorter than the vltime then this behavior would still be wrong as the IP should not be removed before the vltime is reached.
Has there been any progress on this?
I think this conversation has stalled. My PR was not pulled, not sure if any other change was made that fixes this (in my opinion) wrong behavior.
@wteiken Can you test that branch please? It should fix the issue in a better way than just not failing - now we enter the discover process again.
I still need to work it so that expire is setup correctly again though.
Here, commit 5d372e8ac63a53aa1038655e91970d9d85fed277 seemingly resolves the issue of IPv6 randomly dropping completely whenever valid_lft=preferred_lft and PD expires.