core icon indicating copy to clipboard operation
core copied to clipboard

Kea DHCP HA failover for "sync-timeout": 6000 doesn't occur

Open tom-citizencard opened this issue 1 year ago • 3 comments

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

  • [x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
  • [x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

Opensense Version: OPNsense 24.1.6-amd64

We have a straightforward setup with CARP configured for WAN and LAN which is working fine. We also set up Kia DHCP and failover from Master/primary (when it's switched off) to the backup only occurs after 5-6 unacted clients and never occurs after "sync-timeout": 60000. This was tested a few times.

To Reproduce

Steps to reproduce the behavior:

  1. Enable Kea DHCP and set up HA.
  2. Turn off "primary" Kea DHCP server, "standby" server should take over after 60 seconds (60000 milliseconds) - this doesn't happen in our tests. The failover happens only after 5-6 unacted clients.

kea-ctrl-agent.conf on both servers:

{
"Control-agent": {
    "http-host": "127.0.0.1",
    "http-port": 8000,
    "control-sockets": {
        "dhcp4": {
            "socket-type": "unix",
            "socket-name": "/var/run/kea4-ctrl-socket"
        },
        "dhcp6": {
            "socket-type": "unix",
            "socket-name": "/var/run/kea6-ctrl-socket"
        },
        "d2": {
            "socket-type": "unix",
            "socket-name": "/var/run/kea-ddns-ctrl-socket"
        }
    },
    "loggers": [
    {
        "name": "kea-ctrl-agent",
        "output_options": [
            {
                "output": "syslog"
            }
        ],
        "severity": "INFO",
        "debuglevel": 0
    }
  ]
}
}

kea-dhcp4.conf on "primary" / master:

{
    "Dhcp4": {
        "valid-lifetime": 1800,
        "interfaces-config": {
            "interfaces": ["em0"]
        },
        "lease-database": {
            "type": "memfile",
            "persist": true
        },
        "control-socket": {
            "socket-type": "unix",
            "socket-name": "/var/run/kea4-ctrl-socket"
        },
        "loggers": [
            {
                "name": "kea-dhcp4",
                "output_options": [
                    {
                        "output": "syslog"
                    }
                ],
                "severity": "INFO"
            }
        ],
        "subnet4": [
            {
                "id": 1,
                "subnet": "192.168.222.0/24",
                "option-data": [
                    {
                        "name": "domain-name-servers",
                        "data": "192.168.222.1"
                    },
                    {
                        "name": "routers",
                        "data": "192.168.222.1"
                    },
                    {
                        "name": "ntp-servers",
                        "data": "192.168.222.1"
                    },
                    {
                        "name": "domain-name",
                        "data": "citi.intranet"
                    }
                ],
                "pools": [
                    { "pool": "192.168.222.20 - 192.168.222.245" }
                ],
                "reservations": [
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.2",
                        "hostname": "OPNsense1.citi.intranet"
                    },
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.3",
                        "hostname": "OPNsense2.citi.intranet"
                    },
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.6",
                        "hostname": "srvr-2.citi.intranet"
                    },
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.5",
                        "hostname": "srvr-1.citi.intranet"
                    },
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.7",
                        "hostname": "srvr-3.citi.intranet"
                    }
                ]
            }
        ]
        ,"hooks-libraries": [
            {
                "library": "/usr/local/lib/kea/hooks/libdhcp_lease_cmds.so",
                "parameters": { }
            },
            {
                "library": "/usr/local/lib/kea/hooks/libdhcp_ha.so",
                "parameters": {
                    "high-availability": [ {
                        "this-server-name": "OPNsense1",
                        "mode": "hot-standby",
                        "heartbeat-delay": 10000,
                        "max-response-delay": 60000,
                        "max-ack-delay": 5000,
                        "max-unacked-clients": 5,
                        "sync-timeout": 60000,
                        "peers": [
                            {
                                "name": "OPNsense1",
                                "role": "primary",
                                "url": "http://192.168.222.2:8001/"
                            },
                            {
                                "name": "OPNsense2",
                                "role": "standby",
                                "url": "http://192.168.222.3:8001/"
                            }
                        ]
                    } ]
                }
            }
        ]
    }
}

kea-dhcp4.conf on "standby" / backup:

{
    "Dhcp4": {
        "valid-lifetime": 1800,
        "interfaces-config": {
            "interfaces": ["em0"]
        },
        "lease-database": {
            "type": "memfile",
            "persist": true
        },
        "control-socket": {
            "socket-type": "unix",
            "socket-name": "/var/run/kea4-ctrl-socket"
        },
        "loggers": [
            {
                "name": "kea-dhcp4",
                "output_options": [
                    {
                        "output": "syslog"
                    }
                ],
                "severity": "INFO"
            }
        ],
        "subnet4": [
            {
                "id": 1,
                "subnet": "192.168.222.0/24",
                "option-data": [
                    {
                        "name": "domain-name-servers",
                        "data": "192.168.222.1"
                    },
                    {
                        "name": "routers",
                        "data": "192.168.222.1"
                    },
                    {
                        "name": "ntp-servers",
                        "data": "192.168.222.1"
                    },
                    {
                        "name": "domain-name",
                        "data": "citi.intranet"
                    }
                ],
                "pools": [
                    { "pool": "192.168.222.20 - 192.168.222.245" }
                ],
                "reservations": [
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.2",
                        "hostname": "OPNsense1.citi.intranet"
                    },
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.3",
                        "hostname": "OPNsense2.citi.intranet"
                    },
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.6",
                        "hostname": "srvr-2.citi.intranet"
                    },
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.5",
                        "hostname": "srvr-1.citi.intranet"
                    },
                    {
                        "hw-address": "[mac]",
                        "ip-address": "192.168.222.7",
                        "hostname": "srvr-3.citi.intranet"
                    }
                ]
            }
        ]
        ,"hooks-libraries": [
            {
                "library": "/usr/local/lib/kea/hooks/libdhcp_lease_cmds.so",
                "parameters": { }
            },
            {
                "library": "/usr/local/lib/kea/hooks/libdhcp_ha.so",
                "parameters": {
                    "high-availability": [ {
                        "this-server-name": "OPNsense2",
                        "mode": "hot-standby",
                        "heartbeat-delay": 10000,
                        "max-response-delay": 60000,
                        "max-ack-delay": 5000,
                        "max-unacked-clients": 5,
                        "sync-timeout": 60000,
                        "peers": [
                            {
                                "name": "OPNsense1",
                                "role": "primary",
                                "url": "http://192.168.222.2:8001/"
                            },
                            {
                                "name": "OPNsense2",
                                "role": "standby",
                                "url": "http://192.168.222.3:8001/"
                            }
                        ]
                    } ]
                }
            }
        ]
    }
}

Expected behavior

If "primary" Kea DHCP server is unavailable, after 60000 milliseconds (as by default "sync-timeout": 60000) "standby" DHCP server failover should occur and "standby" should take over and start serving leases.

Relevant log files

log.txt

Environment

Opensense Version: OPNsense 24.1.6-amd64

tom-citizencard avatar May 15 '24 13:05 tom-citizencard

not sure if this is new, but looking at https://kea.readthedocs.io/en/latest/arm/hooks.html#hot-standby-configuration "auto-failover": true might be missing. should be rather easy to test locally.

AdSchellevis avatar May 15 '24 13:05 AdSchellevis

not sure if this is new, but looking at https://kea.readthedocs.io/en/latest/arm/hooks.html#hot-standby-configuration "auto-failover": true might be missing. should be rather easy to test locally.

Just tried and updated config for "peers" section on both servers to:

"peers": [
                            {
                                "name": "OPNsense1",
                                "role": "primary",
                                "url": "http://192.168.222.2:8001/",
				"auto-failover": true
                            },
                            {
                                "name": "OPNsense2",
                                "role": "standby",
                                "url": "http://192.168.222.3:8001/",
				"auto-failover": true
                            }
                        ]

Then service was restarted on both (configs were checked after that to ensure OPNsense UI hasn't replaced the changes), "primary" machine was switched off but unfortunately this doesn't solve the problem - failover still doesn't occur automatically on the "standby" even after waiting 10 minutes. It does occur if I restart Kea service on the "standby" machine or 5-6 clients are unacted.

Done some research online and some people suggest to use "max-unacked-clients": 0 but this doesn't seem like a good solution to me as you risk "standby" taking over when "primary" isn't truly unavailable which might result in duplicate leases.

tom-citizencard avatar May 15 '24 15:05 tom-citizencard

Kea seems to be challenging at least unfortunately, if there is an idea of options to add or change, just ping me.

There's not much we can do at this stage I'm afraid (last feature we tried to add didn't appear to be working either for "reasons"), kea's feature set looks large at a first glance, but the functional part appears to be much smaller.

AdSchellevis avatar May 15 '24 15:05 AdSchellevis

I got it working. In addition to "auto-failover": true on each peer, also the max-unacked-clients needs to be set to 0. After 60 sec sync-timeout the standby host is taking over.

                "parameters": {
                    "high-availability": [
                        {
                            "this-server-name": "opn2",
                            "mode": "hot-standby",
                            "heartbeat-delay": 10000,
                            "max-response-delay": 60000,
                            "max-ack-delay": 5000,
                            "max-unacked-clients": 0,
                            "sync-timeout": 60000,
                            "peers": [
                                {
                                    "name": "opn1",
                                    "role": "primary",
                                    "url": "http:\/\/10.24.10.251:8001",
                                    "auto-failover": true
                                },
                                {
                                    "name": "opn2",
                                    "role": "standby",
                                    "url": "http:\/\/10.24.10.252:8001",
                                    "auto-failover": true
                                }
                            ]
                        }
                    ]
                }

reneschuster avatar Sep 23 '24 13:09 reneschuster

@reneschuster we can add those options, let me assign myself to this ticket and fix that when I have a bit of time available. By the way, how did you find the working combination? just curious

AdSchellevis avatar Sep 23 '24 14:09 AdSchellevis

I was searching after a solution for this problem and found this article: https://byte-sized.de/linux-unix/hot-standby-mode-des-kea-dhcp-unter-freebsd-einrichten/

reneschuster avatar Sep 23 '24 14:09 reneschuster

thanks for sharing, we can probably change the defaults on our end, but we should read into the parameters a bit before doing so.

AdSchellevis avatar Sep 23 '24 14:09 AdSchellevis

I have known about it for a while. Let me add that if you set "max-unacked-clients": 0 any network "hiccups" will cause failover which might not be desired. It might be much better to maybe allow users to set the desired value from the UI, defaulting to 5 (current value). So for example I would set it to 2 as I have 60 of so clients on my network and 5 is too large for me (in majority of tests it takes too long for failover to occur, I never managed to get it working with 5). On very large networks with hundreds of clients 5 might be OK (failover would kick in if you have more than 5 clients contacting DHCP to renew leases). Note before 5 clients contact Kea no failover would occur, so leases would need to expire shortly one after another for this to work as probably no one wants to wait 30 or more minutes for failover to kick in. That's why I focused on "sync-timeout" but this parameter doesn't seem to work at all in my tests.

tom-citizencard avatar Sep 23 '24 14:09 tom-citizencard

I don't think we need the auto-failover parameter as it seems to be set to true by default

https://github.com/isc-projects/kea/blob/39625d2b58139f7c5e6eb7634ce314d7a2de2f6d/src/hooks/dhcp/high_availability/ha_config_parser.cc#L55

For some reason I can't find a clear list of options and their defaults at kea's end, which would be extremely useful to have (to avoid searching through lots of source everytime)

AdSchellevis avatar Sep 26 '24 08:09 AdSchellevis

I have known about it for a while. Let me add that if you set "max-unacked-clients": 0 any network "hiccups" will cause failover which might not be desired. It might be much better to maybe allow users to set the desired value from the UI, defaulting to 5 (current value). So for example I would set it to 2 as I have 60 of so clients on my network and 5 is too large for me (in majority of tests it takes too long for failover to occur, I never managed to get it working with 5).

Since most our deployments are smaller, it might be better to choose a lower default in that case, such as 2, but configurable to tweak it for your needs.

AdSchellevis avatar Sep 26 '24 08:09 AdSchellevis

Since most our deployments are smaller, it might be better to choose a lower default in that case, such as 2, but configurable to tweak it for your needs.

If that's the case, "2" is indeed a good default value to go with for "max-unacked-clients".

tom-citizencard avatar Sep 26 '24 08:09 tom-citizencard