talos
talos copied to clipboard
feat: dhcpv4: send current hostname, fix spec compliance of renewals
This adds support for automatically registering node hostnames in DNS by
sending the current hostname to DHCP via option 12. If the current hostname is
updated, issue a new DISCOVER to propagate the update to DHCP (updating the
hostname on lease renewals is not universally supported by DHCP servers). This
addition maintains the previous functionality where the node can also request
its hostname from the DHCP server. The received hostname will be processed and
prioritized as usual by the network.HostnameSpecController.
This change set also contains fixes to make DHCP renewals compliant with RFC
2131, specifically avoiding sending the server identifier and requested IP
address when issuing renewals using a previous offer. This also uncovered an
issue in the upstream insomniacslk/dhcp library, for which a fix is now
pending at https://github.com/insomniacslk/dhcp/pull/469. As upstream
contributions seem to have stalled, I have temporarily added a replacement for
the library to my own fork that carries the fix.
Sending hostname updates have been tested against dnsmasq and the built-in
DHCP + DNS services in Windows Server. Hostname retrieval from DHCP and edge
cases with overridden hostnames from different configuration layers have been
extensively tested against dnsmasq.
Fixes https://github.com/siderolabs/talos/issues/4536.
Acceptance
Please use the following checklist:
- [x] you linked an issue (if applicable)
- [ ] you included tests (if applicable)
- [x] you ran conformance (
make conformance) - [x] you formatted your code (
make fmt) - [x] you linted your code (
make lint) - [x] you generated documentation (
make docs) - [x] you ran unit-tests (
make unit-tests)
/ok-to-test
The CI build job seems to have failed due to a temporary issue in cloning the repo, it needs a restart.
The original (pre-review) version of the sources can be found here: https://github.com/twelho/talos/tree/dhcpv4-send-hostname-v1
The functionality of https://github.com/siderolabs/talos/pull/5897/commits/e85233f7f63036eb887766802b7dc7cb7e5e97e1 has now been tested against dnsmasq, including edge cases. No issues found.
I've split the hostname sending and requesting operations into two separate invocations to completely sidestep the DHCP server "parroting" issue, where the DHCP server just automatically replies with the hostname sent by the client without verifying that it is actually registered on the server.
The current hostname is always sent when acquiring a new lease (which happens also when the local hostname changes, e.g. due to a machine configuration edit), and the DHCP-served hostname is requested right after with an INFORM request. This avoids needing to wait for the first RENEW, which may also never happen depending on the server configuration. RENEWs will still always request the hostname to detect and apply changes done on the DHCP server side. The INFORM support additionally requires some upstream improvements, which I've submitted at https://github.com/insomniacslk/dhcp/pull/470.
I've identified a race condition when testing the INFORM request build, which has now been fixed in the latest commit. Any hostname request (or lease renewal for that matter) is performed using unicast, which requires the address sourced from DHCP to be available on the host (i.e. bound to an interface). This does not happen immediately after notifying the controller about the new address, and thus any immediate INFORM hostname requests will fail to bind to an address. After trying out a couple different approaches, I settled on a context-cancellable wait for both the address assignment and network upon lease negotiation to safely continue with unicast operations. This resolves the random failures I've been seeing when testing this at a large-ish scale.
Another issue I've spotted and subsequently fixed is the machine hostnames disappearing from DNS after a renewal. Turns out that at least Windows Server notices if the hostname is only sent during the initial negotiation and removes it from DNS. The most robust solution I came up with is always sending the hostname during request/renew operations, and extending the use of the INFORM hostname requests to lease renewals as well. So far so good.
It looks like things fail with our "test" DHCP server, but we can fix that of course. This test DHCP server was never supposed to be complete implementation.
It looks like upstreaming changes didn't quite work yet? Anything we can do to help?
It looks like things fail with our "test" DHCP server, but we can fix that of course. This test DHCP server was never supposed to be complete implementation.
It should keep working even if the server doesn't support or can't process DHCPINFORM requests, hostname request errors are non-fatal on purpose. There are some intermittent CI failures, but I haven't been able to pin-point then to this DHCP rewrite. Of course if you have found some specific failure cases I would be happy to try to fix them.
It looks like upstreaming changes didn't quite work yet? Anything we can do to help?
I will ping the maintainers once more with a message in both https://github.com/insomniacslk/dhcp/pull/469 and https://github.com/insomniacslk/dhcp/pull/470. Of course if you can voice the importance of these changes there might be higher chance of people finally reviewing and merging them, as I think I have now addressed all of the feedback (at least in the former PR) and I don't plan on doing any more changes in either.
As we're going to branch off v1.2.0 stable branch today, I moved it to the next iteration - v1.3.0
Update: https://github.com/insomniacslk/dhcp/pull/469 has been merged and I've now refactored https://github.com/insomniacslk/dhcp/pull/470 which might take one more review round. During that work I located and fixed the final long-standing bug that was causing INFORM requests to time out in Talos, so after rebasing this PR on top of master everything is now working correctly as far as I can see.
Update: insomniacslk/dhcp#469 has been merged and I've now refactored insomniacslk/dhcp#470 which might take one more review round. During that work I located and fixed the final long-standing bug that was causing INFORM requests to time out in Talos, so after rebasing this PR on top of master everything is now working correctly as far as I can see.
Thank you, this looks really promising!
@twelho I'm also very grateful for your work. I hope you find time to finish it soon. 🙂
Thanks for the reminders. There's still one more PR to expose an interface we need at https://github.com/insomniacslk/dhcp/pull/470, but seems to have completely stalled. I'll try to refresh it once more in the coming days to hopefully get a response from the maintainers and finally get this work merged.
Great job ! Any plans to finish this one? Thanks
@twelho thanks for your work, we plan to cut Talos beta release next week, and I want this PR to be merged before that.
If the upstream changes doesn't get forward in this time, I'm looking towards forking the DHCP library with your change (or using your fork) until your PR gets merged.
Alright, I feel like the DHCP library work is very close. I've think I've satisfied all of the maintainer's constraints from my side, so it's essentially down to them to just decide how INFORM requests should be constructed. I believe it would help if you could also notify them about the urgency one more time in this thread: https://github.com/insomniacslk/dhcp/pull/470#discussion_r1141479576.
If they can still make up their mind this week and get the work merged, I can rebase/update this PR right away. However, if it looks like it won't happen before next week, I can also finish up this work using my fork for the time being, but avoiding keeping the fork would be my preference as well.
With https://github.com/insomniacslk/dhcp/pull/470 (and https://github.com/insomniacslk/dhcp/pull/498 as a minor fix) finally merged (yay :tada:), this work does not need to rely on my own fork anymore, as all of my changes have now now landed upstream.
With that, I've gone through this PR one more time to remove the replace directive and clean up both the comments and logic. One more mistake was spotted and fixed, namely avoiding sending the server identifier together with INFORM requests as that is against the spec.
Note, however, that I've not yet tested this in its final form. As it's been 8 months since starting this work, I don't have access to the original test setup anymore (which also covered Windows environments), so I will need to create a new test setup for this, but I will only be able to test against dnsmasq. I would appreciate if others would also be able to confirm that this works as intended using their infrastructure.
I did some (small) changes:
- reworked our test
dhcpdto be compliant with the new renewal protocol - did some changes around the hostname change detection logic, there were some changes to Talos since the time you started your work @twelho (default hostnames), in my testing it seems to work for all cases reasonably well
- some cosmetic changes
The question of detecting "is it DHCP hostname", "should I send the hostname" is still not 100% reliable, but seems to be better now.
peer review in bullpen?
/promote integration-misc
/promote e2e
/promote e2e-aws
/promote integration-qemu
/promote e2e-aws
/promote e2e-nothing
/m