Kea DHCP HA failover for "sync-timeout": 6000 doesn't occur
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
- [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
- [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue
Describe the bug
Opensense Version: OPNsense 24.1.6-amd64
We have a straightforward setup with CARP configured for WAN and LAN which is working fine. We also set up Kia DHCP and failover from Master/primary (when it's switched off) to the backup only occurs after 5-6 unacted clients and never occurs after "sync-timeout": 60000. This was tested a few times.
To Reproduce
Steps to reproduce the behavior:
- Enable Kea DHCP and set up HA.
- Turn off "primary" Kea DHCP server, "standby" server should take over after 60 seconds (60000 milliseconds) - this doesn't happen in our tests. The failover happens only after 5-6 unacted clients.
kea-ctrl-agent.conf on both servers:
{
"Control-agent": {
"http-host": "127.0.0.1",
"http-port": 8000,
"control-sockets": {
"dhcp4": {
"socket-type": "unix",
"socket-name": "/var/run/kea4-ctrl-socket"
},
"dhcp6": {
"socket-type": "unix",
"socket-name": "/var/run/kea6-ctrl-socket"
},
"d2": {
"socket-type": "unix",
"socket-name": "/var/run/kea-ddns-ctrl-socket"
}
},
"loggers": [
{
"name": "kea-ctrl-agent",
"output_options": [
{
"output": "syslog"
}
],
"severity": "INFO",
"debuglevel": 0
}
]
}
}
kea-dhcp4.conf on "primary" / master:
{
"Dhcp4": {
"valid-lifetime": 1800,
"interfaces-config": {
"interfaces": ["em0"]
},
"lease-database": {
"type": "memfile",
"persist": true
},
"control-socket": {
"socket-type": "unix",
"socket-name": "/var/run/kea4-ctrl-socket"
},
"loggers": [
{
"name": "kea-dhcp4",
"output_options": [
{
"output": "syslog"
}
],
"severity": "INFO"
}
],
"subnet4": [
{
"id": 1,
"subnet": "192.168.222.0/24",
"option-data": [
{
"name": "domain-name-servers",
"data": "192.168.222.1"
},
{
"name": "routers",
"data": "192.168.222.1"
},
{
"name": "ntp-servers",
"data": "192.168.222.1"
},
{
"name": "domain-name",
"data": "citi.intranet"
}
],
"pools": [
{ "pool": "192.168.222.20 - 192.168.222.245" }
],
"reservations": [
{
"hw-address": "[mac]",
"ip-address": "192.168.222.2",
"hostname": "OPNsense1.citi.intranet"
},
{
"hw-address": "[mac]",
"ip-address": "192.168.222.3",
"hostname": "OPNsense2.citi.intranet"
},
{
"hw-address": "[mac]",
"ip-address": "192.168.222.6",
"hostname": "srvr-2.citi.intranet"
},
{
"hw-address": "[mac]",
"ip-address": "192.168.222.5",
"hostname": "srvr-1.citi.intranet"
},
{
"hw-address": "[mac]",
"ip-address": "192.168.222.7",
"hostname": "srvr-3.citi.intranet"
}
]
}
]
,"hooks-libraries": [
{
"library": "/usr/local/lib/kea/hooks/libdhcp_lease_cmds.so",
"parameters": { }
},
{
"library": "/usr/local/lib/kea/hooks/libdhcp_ha.so",
"parameters": {
"high-availability": [ {
"this-server-name": "OPNsense1",
"mode": "hot-standby",
"heartbeat-delay": 10000,
"max-response-delay": 60000,
"max-ack-delay": 5000,
"max-unacked-clients": 5,
"sync-timeout": 60000,
"peers": [
{
"name": "OPNsense1",
"role": "primary",
"url": "http://192.168.222.2:8001/"
},
{
"name": "OPNsense2",
"role": "standby",
"url": "http://192.168.222.3:8001/"
}
]
} ]
}
}
]
}
}
kea-dhcp4.conf on "standby" / backup:
{
"Dhcp4": {
"valid-lifetime": 1800,
"interfaces-config": {
"interfaces": ["em0"]
},
"lease-database": {
"type": "memfile",
"persist": true
},
"control-socket": {
"socket-type": "unix",
"socket-name": "/var/run/kea4-ctrl-socket"
},
"loggers": [
{
"name": "kea-dhcp4",
"output_options": [
{
"output": "syslog"
}
],
"severity": "INFO"
}
],
"subnet4": [
{
"id": 1,
"subnet": "192.168.222.0/24",
"option-data": [
{
"name": "domain-name-servers",
"data": "192.168.222.1"
},
{
"name": "routers",
"data": "192.168.222.1"
},
{
"name": "ntp-servers",
"data": "192.168.222.1"
},
{
"name": "domain-name",
"data": "citi.intranet"
}
],
"pools": [
{ "pool": "192.168.222.20 - 192.168.222.245" }
],
"reservations": [
{
"hw-address": "[mac]",
"ip-address": "192.168.222.2",
"hostname": "OPNsense1.citi.intranet"
},
{
"hw-address": "[mac]",
"ip-address": "192.168.222.3",
"hostname": "OPNsense2.citi.intranet"
},
{
"hw-address": "[mac]",
"ip-address": "192.168.222.6",
"hostname": "srvr-2.citi.intranet"
},
{
"hw-address": "[mac]",
"ip-address": "192.168.222.5",
"hostname": "srvr-1.citi.intranet"
},
{
"hw-address": "[mac]",
"ip-address": "192.168.222.7",
"hostname": "srvr-3.citi.intranet"
}
]
}
]
,"hooks-libraries": [
{
"library": "/usr/local/lib/kea/hooks/libdhcp_lease_cmds.so",
"parameters": { }
},
{
"library": "/usr/local/lib/kea/hooks/libdhcp_ha.so",
"parameters": {
"high-availability": [ {
"this-server-name": "OPNsense2",
"mode": "hot-standby",
"heartbeat-delay": 10000,
"max-response-delay": 60000,
"max-ack-delay": 5000,
"max-unacked-clients": 5,
"sync-timeout": 60000,
"peers": [
{
"name": "OPNsense1",
"role": "primary",
"url": "http://192.168.222.2:8001/"
},
{
"name": "OPNsense2",
"role": "standby",
"url": "http://192.168.222.3:8001/"
}
]
} ]
}
}
]
}
}
Expected behavior
If "primary" Kea DHCP server is unavailable, after 60000 milliseconds (as by default "sync-timeout": 60000) "standby" DHCP server failover should occur and "standby" should take over and start serving leases.
Relevant log files
Environment
Opensense Version: OPNsense 24.1.6-amd64
not sure if this is new, but looking at https://kea.readthedocs.io/en/latest/arm/hooks.html#hot-standby-configuration "auto-failover": true might be missing. should be rather easy to test locally.
not sure if this is new, but looking at https://kea.readthedocs.io/en/latest/arm/hooks.html#hot-standby-configuration
"auto-failover": truemight be missing. should be rather easy to test locally.
Just tried and updated config for "peers" section on both servers to:
"peers": [
{
"name": "OPNsense1",
"role": "primary",
"url": "http://192.168.222.2:8001/",
"auto-failover": true
},
{
"name": "OPNsense2",
"role": "standby",
"url": "http://192.168.222.3:8001/",
"auto-failover": true
}
]
Then service was restarted on both (configs were checked after that to ensure OPNsense UI hasn't replaced the changes), "primary" machine was switched off but unfortunately this doesn't solve the problem - failover still doesn't occur automatically on the "standby" even after waiting 10 minutes. It does occur if I restart Kea service on the "standby" machine or 5-6 clients are unacted.
Done some research online and some people suggest to use "max-unacked-clients": 0 but this doesn't seem like a good solution to me as you risk "standby" taking over when "primary" isn't truly unavailable which might result in duplicate leases.
Kea seems to be challenging at least unfortunately, if there is an idea of options to add or change, just ping me.
There's not much we can do at this stage I'm afraid (last feature we tried to add didn't appear to be working either for "reasons"), kea's feature set looks large at a first glance, but the functional part appears to be much smaller.
I got it working. In addition to "auto-failover": true on each peer, also the max-unacked-clients needs to be set to 0. After 60 sec sync-timeout the standby host is taking over.
"parameters": {
"high-availability": [
{
"this-server-name": "opn2",
"mode": "hot-standby",
"heartbeat-delay": 10000,
"max-response-delay": 60000,
"max-ack-delay": 5000,
"max-unacked-clients": 0,
"sync-timeout": 60000,
"peers": [
{
"name": "opn1",
"role": "primary",
"url": "http:\/\/10.24.10.251:8001",
"auto-failover": true
},
{
"name": "opn2",
"role": "standby",
"url": "http:\/\/10.24.10.252:8001",
"auto-failover": true
}
]
}
]
}
@reneschuster we can add those options, let me assign myself to this ticket and fix that when I have a bit of time available. By the way, how did you find the working combination? just curious
I was searching after a solution for this problem and found this article: https://byte-sized.de/linux-unix/hot-standby-mode-des-kea-dhcp-unter-freebsd-einrichten/
thanks for sharing, we can probably change the defaults on our end, but we should read into the parameters a bit before doing so.
I have known about it for a while. Let me add that if you set "max-unacked-clients": 0 any network "hiccups" will cause failover which might not be desired. It might be much better to maybe allow users to set the desired value from the UI, defaulting to 5 (current value). So for example I would set it to 2 as I have 60 of so clients on my network and 5 is too large for me (in majority of tests it takes too long for failover to occur, I never managed to get it working with 5). On very large networks with hundreds of clients 5 might be OK (failover would kick in if you have more than 5 clients contacting DHCP to renew leases). Note before 5 clients contact Kea no failover would occur, so leases would need to expire shortly one after another for this to work as probably no one wants to wait 30 or more minutes for failover to kick in. That's why I focused on "sync-timeout" but this parameter doesn't seem to work at all in my tests.
I don't think we need the auto-failover parameter as it seems to be set to true by default
https://github.com/isc-projects/kea/blob/39625d2b58139f7c5e6eb7634ce314d7a2de2f6d/src/hooks/dhcp/high_availability/ha_config_parser.cc#L55
For some reason I can't find a clear list of options and their defaults at kea's end, which would be extremely useful to have (to avoid searching through lots of source everytime)
I have known about it for a while. Let me add that if you set "max-unacked-clients": 0 any network "hiccups" will cause failover which might not be desired. It might be much better to maybe allow users to set the desired value from the UI, defaulting to 5 (current value). So for example I would set it to 2 as I have 60 of so clients on my network and 5 is too large for me (in majority of tests it takes too long for failover to occur, I never managed to get it working with 5).
Since most our deployments are smaller, it might be better to choose a lower default in that case, such as 2, but configurable to tweak it for your needs.
Since most our deployments are smaller, it might be better to choose a lower default in that case, such as 2, but configurable to tweak it for your needs.
If that's the case, "2" is indeed a good default value to go with for "max-unacked-clients".