talos icon indicating copy to clipboard operation
talos copied to clipboard

feat: dhcpv4: send current hostname, fix spec compliance of renewals

Open twelho opened this issue 3 years ago • 12 comments

This adds support for automatically registering node hostnames in DNS by sending the current hostname to DHCP via option 12. If the current hostname is updated, issue a new DISCOVER to propagate the update to DHCP (updating the hostname on lease renewals is not universally supported by DHCP servers). This addition maintains the previous functionality where the node can also request its hostname from the DHCP server. The received hostname will be processed and prioritized as usual by the network.HostnameSpecController.

This change set also contains fixes to make DHCP renewals compliant with RFC 2131, specifically avoiding sending the server identifier and requested IP address when issuing renewals using a previous offer. This also uncovered an issue in the upstream insomniacslk/dhcp library, for which a fix is now pending at https://github.com/insomniacslk/dhcp/pull/469. As upstream contributions seem to have stalled, I have temporarily added a replacement for the library to my own fork that carries the fix.

Sending hostname updates have been tested against dnsmasq and the built-in DHCP + DNS services in Windows Server. Hostname retrieval from DHCP and edge cases with overridden hostnames from different configuration layers have been extensively tested against dnsmasq.

Fixes https://github.com/siderolabs/talos/issues/4536.

Acceptance

Please use the following checklist:

  • [x] you linked an issue (if applicable)
  • [ ] you included tests (if applicable)
  • [x] you ran conformance (make conformance)
  • [x] you formatted your code (make fmt)
  • [x] you linted your code (make lint)
  • [x] you generated documentation (make docs)
  • [x] you ran unit-tests (make unit-tests)

twelho avatar Jul 12 '22 12:07 twelho

/ok-to-test

smira avatar Jul 12 '22 12:07 smira

The CI build job seems to have failed due to a temporary issue in cloning the repo, it needs a restart.

twelho avatar Jul 13 '22 11:07 twelho

The original (pre-review) version of the sources can be found here: https://github.com/twelho/talos/tree/dhcpv4-send-hostname-v1

twelho avatar Jul 13 '22 11:07 twelho

The functionality of https://github.com/siderolabs/talos/pull/5897/commits/e85233f7f63036eb887766802b7dc7cb7e5e97e1 has now been tested against dnsmasq, including edge cases. No issues found.

twelho avatar Jul 13 '22 12:07 twelho

I've split the hostname sending and requesting operations into two separate invocations to completely sidestep the DHCP server "parroting" issue, where the DHCP server just automatically replies with the hostname sent by the client without verifying that it is actually registered on the server.

The current hostname is always sent when acquiring a new lease (which happens also when the local hostname changes, e.g. due to a machine configuration edit), and the DHCP-served hostname is requested right after with an INFORM request. This avoids needing to wait for the first RENEW, which may also never happen depending on the server configuration. RENEWs will still always request the hostname to detect and apply changes done on the DHCP server side. The INFORM support additionally requires some upstream improvements, which I've submitted at https://github.com/insomniacslk/dhcp/pull/470.

twelho avatar Jul 19 '22 14:07 twelho

I've identified a race condition when testing the INFORM request build, which has now been fixed in the latest commit. Any hostname request (or lease renewal for that matter) is performed using unicast, which requires the address sourced from DHCP to be available on the host (i.e. bound to an interface). This does not happen immediately after notifying the controller about the new address, and thus any immediate INFORM hostname requests will fail to bind to an address. After trying out a couple different approaches, I settled on a context-cancellable wait for both the address assignment and network upon lease negotiation to safely continue with unicast operations. This resolves the random failures I've been seeing when testing this at a large-ish scale.

Another issue I've spotted and subsequently fixed is the machine hostnames disappearing from DNS after a renewal. Turns out that at least Windows Server notices if the hostname is only sent during the initial negotiation and removes it from DNS. The most robust solution I came up with is always sending the hostname during request/renew operations, and extending the use of the INFORM hostname requests to lease renewals as well. So far so good.

twelho avatar Jul 27 '22 13:07 twelho

It looks like things fail with our "test" DHCP server, but we can fix that of course. This test DHCP server was never supposed to be complete implementation.

smira avatar Aug 08 '22 13:08 smira

It looks like upstreaming changes didn't quite work yet? Anything we can do to help?

smira avatar Aug 08 '22 13:08 smira

It looks like things fail with our "test" DHCP server, but we can fix that of course. This test DHCP server was never supposed to be complete implementation.

It should keep working even if the server doesn't support or can't process DHCPINFORM requests, hostname request errors are non-fatal on purpose. There are some intermittent CI failures, but I haven't been able to pin-point then to this DHCP rewrite. Of course if you have found some specific failure cases I would be happy to try to fix them.

It looks like upstreaming changes didn't quite work yet? Anything we can do to help?

I will ping the maintainers once more with a message in both https://github.com/insomniacslk/dhcp/pull/469 and https://github.com/insomniacslk/dhcp/pull/470. Of course if you can voice the importance of these changes there might be higher chance of people finally reviewing and merging them, as I think I have now addressed all of the feedback (at least in the former PR) and I don't plan on doing any more changes in either.

twelho avatar Aug 08 '22 15:08 twelho

As we're going to branch off v1.2.0 stable branch today, I moved it to the next iteration - v1.3.0

smira avatar Aug 15 '22 13:08 smira

Update: https://github.com/insomniacslk/dhcp/pull/469 has been merged and I've now refactored https://github.com/insomniacslk/dhcp/pull/470 which might take one more review round. During that work I located and fixed the final long-standing bug that was causing INFORM requests to time out in Talos, so after rebasing this PR on top of master everything is now working correctly as far as I can see.

twelho avatar Sep 11 '22 12:09 twelho

Update: insomniacslk/dhcp#469 has been merged and I've now refactored insomniacslk/dhcp#470 which might take one more review round. During that work I located and fixed the final long-standing bug that was causing INFORM requests to time out in Talos, so after rebasing this PR on top of master everything is now working correctly as far as I can see.

Thank you, this looks really promising!

smira avatar Sep 12 '22 13:09 smira

@twelho I'm also very grateful for your work. I hope you find time to finish it soon. 🙂

sanmai-NL avatar Feb 02 '23 12:02 sanmai-NL

Thanks for the reminders. There's still one more PR to expose an interface we need at https://github.com/insomniacslk/dhcp/pull/470, but seems to have completely stalled. I'll try to refresh it once more in the coming days to hopefully get a response from the maintainers and finally get this work merged.

twelho avatar Feb 11 '23 15:02 twelho

Great job ! Any plans to finish this one? Thanks

dimatha avatar Feb 26 '23 18:02 dimatha

@twelho thanks for your work, we plan to cut Talos beta release next week, and I want this PR to be merged before that.

If the upstream changes doesn't get forward in this time, I'm looking towards forking the DHCP library with your change (or using your fork) until your PR gets merged.

smira avatar Mar 23 '23 12:03 smira

Alright, I feel like the DHCP library work is very close. I've think I've satisfied all of the maintainer's constraints from my side, so it's essentially down to them to just decide how INFORM requests should be constructed. I believe it would help if you could also notify them about the urgency one more time in this thread: https://github.com/insomniacslk/dhcp/pull/470#discussion_r1141479576.

If they can still make up their mind this week and get the work merged, I can rebase/update this PR right away. However, if it looks like it won't happen before next week, I can also finish up this work using my fork for the time being, but avoiding keeping the fork would be my preference as well.

twelho avatar Mar 23 '23 13:03 twelho

With https://github.com/insomniacslk/dhcp/pull/470 (and https://github.com/insomniacslk/dhcp/pull/498 as a minor fix) finally merged (yay :tada:), this work does not need to rely on my own fork anymore, as all of my changes have now now landed upstream.

With that, I've gone through this PR one more time to remove the replace directive and clean up both the comments and logic. One more mistake was spotted and fixed, namely avoiding sending the server identifier together with INFORM requests as that is against the spec.

Note, however, that I've not yet tested this in its final form. As it's been 8 months since starting this work, I don't have access to the original test setup anymore (which also covered Windows environments), so I will need to create a new test setup for this, but I will only be able to test against dnsmasq. I would appreciate if others would also be able to confirm that this works as intended using their infrastructure.

twelho avatar Mar 27 '23 20:03 twelho

I did some (small) changes:

  • reworked our test dhcpd to be compliant with the new renewal protocol
  • did some changes around the hostname change detection logic, there were some changes to Talos since the time you started your work @twelho (default hostnames), in my testing it seems to work for all cases reasonably well
  • some cosmetic changes

The question of detecting "is it DHCP hostname", "should I send the hostname" is still not 100% reliable, but seems to be better now.

smira avatar Mar 29 '23 13:03 smira

peer review in bullpen?

frezbo avatar Mar 29 '23 13:03 frezbo

/promote integration-misc

smira avatar Mar 29 '23 14:03 smira

/promote e2e

smira avatar Mar 29 '23 14:03 smira

/promote e2e-aws

smira avatar Mar 29 '23 15:03 smira

/promote integration-qemu

smira avatar Mar 29 '23 17:03 smira

/promote e2e-aws

smira avatar Mar 29 '23 17:03 smira

/promote e2e-nothing

smira avatar Mar 29 '23 18:03 smira

/m

smira avatar Mar 29 '23 18:03 smira