dhcpcd icon indicating copy to clipboard operation
dhcpcd copied to clipboard

Failure to renew with "ps_root_recvmsg: Network is unreachable"

Open JognSmit opened this issue 11 months ago • 5 comments

dhcpcd 10.1.0

I have 2 interfaces. eno5 is connected to fiber provider and eno8.2 to starlink. Once in a while, "something" happens dhcpcd during renewal process of eno5 address drops default route through it, which is followed by a number of "ps_root_recvmsg: Network is unreachable" messages and failure to get a new lease. Insisting on renewing lease by running dhcpcd -n eno5 resolves problem

2025-01-22T12:26:13.832108-08:00 router dhcpcd[974484]: eno8.2: renewing lease of 100.xx.xxx.xxx
2025-01-22T12:26:13.832984-08:00 router dhcpcd[974484]: eno8.2: sending REQUEST (xid 0x788daebc), next in 4.2 seconds
2025-01-22T12:26:13.833329-08:00 router dhcpcd[974484]: eno8.2: acknowledged 100.xx.xxx.xxx from 100.64.0.1
2025-01-22T12:26:13.833561-08:00 router dhcpcd[974484]: eno8.2: leased 100.xx.xxx.xxx for 300 seconds
2025-01-22T12:26:13.833915-08:00 router dhcpcd[974484]: eno8.2: renew in 150 seconds, rebind in 262 seconds
2025-01-22T12:26:13.834133-08:00 router dhcpcd[974484]: eno8.2: writing lease: /var/lib/dhcpcd/eno8.2.lease
2025-01-22T12:26:13.834336-08:00 router dhcpcd[974484]: eno8.2: adding IP address 100.xx.xxx.xxx/10 broadcast 100.127.255.255
2025-01-22T12:26:13.834547-08:00 router dhcpcd[974484]: eno8.2: executing: /usr/lib/dhcpcd/dhcpcd-run-hooks RENEW
2025-01-22T12:28:43.455886-08:00 router dhcpcd[974484]: eno5: renewing lease of 47.yyy.yyy.78
2025-01-22T12:28:43.456488-08:00 router dhcpcd[974484]: eno5: sending REQUEST (xid 0xcc8b9d2d), next in 4.4 seconds
2025-01-22T12:28:43.834049-08:00 router dhcpcd[974484]: eno8.2: renewing lease of 100.xx.xxx.xxx
2025-01-22T12:28:43.840894-08:00 router dhcpcd[974484]: eno8.2: sending REQUEST (xid 0x56b9244), next in 3.0 seconds
2025-01-22T12:28:43.841049-08:00 router dhcpcd[974484]: eno8.2: acknowledged 100.xx.xxx.xxx from 100.64.0.1
2025-01-22T12:28:43.841186-08:00 router dhcpcd[974484]: eno8.2: leased 100.xx.xxx.xxx for 300 seconds
2025-01-22T12:28:43.841315-08:00 router dhcpcd[974484]: eno8.2: renew in 150 seconds, rebind in 262 seconds
2025-01-22T12:28:43.841448-08:00 router dhcpcd[974484]: eno8.2: writing lease: /var/lib/dhcpcd/eno8.2.lease
2025-01-22T12:28:43.841591-08:00 router dhcpcd[974484]: eno8.2: adding IP address 100.xx.xxx.xxx/10 broadcast 100.127.255.255
2025-01-22T12:28:43.841714-08:00 router dhcpcd[974484]: eno5: deleting route to 47.yyy.yyy.0/23
2025-01-22T12:28:43.841889-08:00 router dhcpcd[974484]: eno5: deleting default route via 47.yyy.yyy.1
2025-01-22T12:28:43.842030-08:00 router dhcpcd[974484]: eno8.2: executing: /usr/lib/dhcpcd/dhcpcd-run-hooks RENEW
2025-01-22T12:28:47.831178-08:00 router dhcpcd[974484]: eno5: sending REQUEST (xid 0xcc8b9d2d), next in 7.9 seconds
2025-01-22T12:28:47.831530-08:00 router dhcpcd[974484]: ps_root_recvmsg: Network is unreachable
2025-01-22T12:28:55.695001-08:00 router dhcpcd[974484]: eno5: sending REQUEST (xid 0xcc8b9d2d), next in 15.7 seconds
2025-01-22T12:28:55.695600-08:00 router dhcpcd[974484]: ps_root_recvmsg: Network is unreachable
2025-01-22T12:29:11.354362-08:00 router dhcpcd[974484]: eno5: sending REQUEST (xid 0xcc8b9d2d), next in 31.5 seconds
2025-01-22T12:29:11.354965-08:00 router dhcpcd[974484]: ps_root_recvmsg: Network is unreachable
2025-01-22T12:29:42.858769-08:00 router dhcpcd[974484]: eno5: sending REQUEST (xid 0xcc8b9d2d), next in 63.8 seconds
2025-01-22T12:29:42.859655-08:00 router dhcpcd[974484]: ps_root_recvmsg: Network is unreachable
2025-01-22T12:29:43.462247-08:00 router dhcpcd[974484]: control command: dhcpcd -n eno5
2025-01-22T12:29:43.463884-08:00 router dhcpcd[974484]: eno5: executing: /usr/lib/dhcpcd/dhcpcd-run-hooks RECONFIGURE
2025-01-22T12:29:43.476871-08:00 router dhcpcd[974484]: eno5: IAID ef:0c:9f:dc
2025-01-22T12:29:43.477024-08:00 router dhcpcd[974484]: eno5: delaying IPv4 for 0.3 seconds
2025-01-22T12:29:43.764028-08:00 router dhcpcd[974484]: eno5: reading lease: /var/lib/dhcpcd/eno5.lease
2025-01-22T12:29:43.764338-08:00 router dhcpcd[974484]: eno5: rebinding lease of 47.yyy.yyy.78
2025-01-22T12:29:43.764462-08:00 router dhcpcd[974484]: eno5: sending REQUEST (xid 0x164683cc), next in 3.6 seconds
2025-01-22T12:29:43.764784-08:00 router dhcpcd[974484]: eno5: spawned BPF BOOTP on PID 1473641
2025-01-22T12:29:44.054601-08:00 router dhcpcd[974484]: eno5: acknowledged 47.yyy.yyy.78 from 74.42.148.118
2025-01-22T12:29:44.054879-08:00 router dhcpcd[974484]: eno5: leased 47.yyy.yyy.78 for 1800 seconds
2025-01-22T12:29:44.055092-08:00 router dhcpcd[974484]: eno5: renew in 900 seconds, rebind in 1575 seconds
2025-01-22T12:29:44.055273-08:00 router dhcpcd[974484]: eno5: writing lease: /var/lib/dhcpcd/eno5.lease
2025-01-22T12:29:44.055444-08:00 router dhcpcd[974484]: eno5: adding IP address 47.yyy.yyy.78/23 broadcast 47.yyy.zzz.255
2025-01-22T12:29:44.055566-08:00 router dhcpcd[974484]: eno5: adding route to 47.yyy.yyy.0/23
2025-01-22T12:29:44.055694-08:00 router dhcpcd[974484]: eno5: adding default route via 47.yyy.yyy.1
2025-01-22T12:29:44.055809-08:00 router dhcpcd[974484]: eno5: executing: /usr/lib/dhcpcd/dhcpcd-run-hooks RENEW
2025-01-22T12:29:44.094381-08:00 router dhcpcd[974484]: eno5: BPF BOOTP exited from PID 1473641

config is:

allowinterfaces eno8.2 eno5
debug
#hostname
duid
vendorclassid
option domain_name_servers, domain_name, domain_search
option classless_static_routes
option interface_mtu
option host_name
nogateway
noipv6rs
noipv4ll
option rapid_commit
require dhcp_server_identifier
nohook resolv.conf
interface eno5
        gateway
        noarp

JognSmit avatar Jan 22 '25 21:01 JognSmit

Here's the interesting section:

2025-01-22T12:28:43.834049-08:00 router dhcpcd[974484]: eno8.2: renewing lease of 100.xx.xxx.xxx
2025-01-22T12:28:43.840894-08:00 router dhcpcd[974484]: eno8.2: sending REQUEST (xid 0x56b9244), next in 3.0 seconds
2025-01-22T12:28:43.841049-08:00 router dhcpcd[974484]: eno8.2: acknowledged 100.xx.xxx.xxx from 100.64.0.1
2025-01-22T12:28:43.841186-08:00 router dhcpcd[974484]: eno8.2: leased 100.xx.xxx.xxx for 300 seconds
2025-01-22T12:28:43.841315-08:00 router dhcpcd[974484]: eno8.2: renew in 150 seconds, rebind in 262 seconds
2025-01-22T12:28:43.841448-08:00 router dhcpcd[974484]: eno8.2: writing lease: /var/lib/dhcpcd/eno8.2.lease
2025-01-22T12:28:43.841591-08:00 router dhcpcd[974484]: eno8.2: adding IP address 100.xx.xxx.xxx/10 broadcast 100.127.255.255
2025-01-22T12:28:43.841714-08:00 router dhcpcd[974484]: eno5: deleting route to 47.yyy.yyy.0/23
2025-01-22T12:28:43.841889-08:00 router dhcpcd[974484]: eno5: deleting default route via 47.yyy.yyy.1
2025-01-22T12:28:43.842030-08:00 router dhcpcd[974484]: eno8.2: executing: /usr/lib/dhcpcd/dhcpcd-run-hooks RENEW
2025-01-22T12:28:47.831178-08:00 router dhcpcd[974484]: eno5: sending REQUEST (xid 0xcc8b9d2d), next in 7.9 seconds
2025-01-22T12:28:47.831530-08:00 router dhcpcd[974484]: ps_root_recvmsg: Network is unreachable

eno8.2 starts a renew and when the lease is applied the subnet and default routs are deleted from eno5 for reasons unknown - that is the bug. This causes a renew failure on eno5. eno5 should recover when T2 is reached as that will trigger a REBIND which the OP shows works when done manually. However this results in a window of 600 seconds without a useable config which isn't ideal.

rsmarples avatar Jan 26 '25 16:01 rsmarples

is there a way that I can collect more information that will clarify situation ? I had hopes that enabling debug will be more informative in order to understand what happens.

JognSmit avatar Jan 26 '25 17:01 JognSmit

It’s pretty clear what is happening, just not the why. Is one of the interfaces a VLAN and is it on the other interface or a different one?

rsmarples avatar Jan 26 '25 18:01 rsmarples

different interfaces. one with vlan (eno8.2) connected via switch to starlink eno5 going directly to ONT

by "clarify" i meant to see "why". some magical extra-extra-debug setting that dumps some decision making log messages...

JognSmit avatar Jan 26 '25 18:01 JognSmit

it seems to happen every 2-3 days. is there anything i can do in order to gather some additional information that could help to resolve it ?

JognSmit avatar Feb 03 '25 17:02 JognSmit

Here's the interesting section:

2025-01-22T12:28:43.834049-08:00 router dhcpcd[974484]: eno8.2: renewing lease of 100.xx.xxx.xxx
2025-01-22T12:28:43.840894-08:00 router dhcpcd[974484]: eno8.2: sending REQUEST (xid 0x56b9244), next in 3.0 seconds
2025-01-22T12:28:43.841049-08:00 router dhcpcd[974484]: eno8.2: acknowledged 100.xx.xxx.xxx from 100.64.0.1
2025-01-22T12:28:43.841186-08:00 router dhcpcd[974484]: eno8.2: leased 100.xx.xxx.xxx for 300 seconds
2025-01-22T12:28:43.841315-08:00 router dhcpcd[974484]: eno8.2: renew in 150 seconds, rebind in 262 seconds
2025-01-22T12:28:43.841448-08:00 router dhcpcd[974484]: eno8.2: writing lease: /var/lib/dhcpcd/eno8.2.lease
2025-01-22T12:28:43.841591-08:00 router dhcpcd[974484]: eno8.2: adding IP address 100.xx.xxx.xxx/10 broadcast 100.127.255.255
2025-01-22T12:28:43.841714-08:00 router dhcpcd[974484]: eno5: deleting route to 47.yyy.yyy.0/23
2025-01-22T12:28:43.841889-08:00 router dhcpcd[974484]: eno5: deleting default route via 47.yyy.yyy.1
2025-01-22T12:28:43.842030-08:00 router dhcpcd[974484]: eno8.2: executing: /usr/lib/dhcpcd/dhcpcd-run-hooks RENEW
2025-01-22T12:28:47.831178-08:00 router dhcpcd[974484]: eno5: sending REQUEST (xid 0xcc8b9d2d), next in 7.9 seconds
2025-01-22T12:28:47.831530-08:00 router dhcpcd[974484]: ps_root_recvmsg: Network is unreachable

eno8.2 starts a renew and when the lease is applied the subnet and default routs are deleted from eno5 for reasons unknown - that is the bug. This causes a renew failure on eno5. eno5 should recover when T2 is reached as that will trigger a REBIND which the OP shows works when done manually. However this results in a window of 600 seconds without a useable config which isn't ideal.

you actually missed a couple of lines above

2025-01-22T12:28:43.455886-08:00 router dhcpcd[974484]: eno5: renewing lease of 47.yyy.yyy.78 2025-01-22T12:28:43.456488-08:00 router dhcpcd[974484]: eno5: sending REQUEST (xid 0xcc8b9d2d), next in 4.4 second

eno5 also starts renew half a second before. it's just eno8.2 lease is renewed quicker even though it starts later (in all failures that i have eno5 starts renewing within second or so of eno8.2).

i tried to go through code, but my C is very-very rusty. can there be case that because eno5 is in process of renewal when rt_build runs after completion of lease renewal of eno8.2 (if i correctly understand the code), it decides to clean up some of it routes ?

JognSmit avatar Aug 15 '25 04:08 JognSmit

Hi, As I wrote in previous comment problem happens when there are 2 renew processes run within same time interval (i have 150sec renew on startlink and 900sec on frontier. so they frequently overlap).

I tried to dig through code. Can there be case that when rt_build runs for eno8.2, when it collects current routes inet_getroutes won't return routes belonging to eno5 because it's in DHS_RENEW state and because of it they will be removed ?

JognSmit avatar Aug 16 '25 03:08 JognSmit

@rsmarples any chance my theory from above is right and fix will be to add interface in DHS_RENEW state to be collected ?

JognSmit avatar Nov 19 '25 19:11 JognSmit

For the IPv4 case routes are generated based on the STATE_ADDED flag which is separate from the DHCP engine state.

I've had another quick look through the code and it's still not obvious to me how it's being lost from the logs you've provided.

rsmarples avatar Nov 20 '25 13:11 rsmarples

@rsmarples you obviously familiar with code better.

so i'll just explain what i think i see in more details, to see if it makes any sense for you

I went back from "deleting route" messages which are logged in "rt_delete". The only place that rt_delete is called from is from rt_build where during comparison of newly build tables of routes (after lease was obtained or renewed) and old one (the code under the comment "/* Remove old routes we used to manage. */")

New list of routes generate in the beginning by calling inet_getroutes which calls inet_dhcproutes per interface

	TAILQ_FOREACH(ifp, ctx->ifaces, next) {
		if (!ifp->active)
			continue;
		if (inet_dhcproutes(routes, ifp, &have_default) == -1)
			return false;

In the beginning of the inet_dhcproutes there is code

	if (state == NULL || state->state != DHS_BOUND || !state->added)
		return 0;

at this point of time system is in state (if i understand correctly) where

eno8.2 state is DHS_BOUND. it routes are collected eno5 state is DHS_RENEW. it routes are not collected and nuked later by rt_build.

the check for STATE_ADDED is a couple of functions later in dhcp_get_routes > STATE_ADDED, but eno5 interface never gets there as inet_dhcproutes exist in the beginning.

JognSmit avatar Nov 20 '25 21:11 JognSmit

@JognSmit thanks for pointing out how dumb I was for so many months! We should not be checking DHCP state there at all. Hopefully the above patch/branch fixes it for you.

rsmarples avatar Nov 21 '25 11:11 rsmarples