netbird icon indicating copy to clipboard operation
netbird copied to clipboard

Apply host DNS settings on peer state change

Open hurricanehrndz opened this issue 1 year ago • 22 comments

Describe your changes

This is a PoC and a potential fix #2002. This patch is not full fleshed out and their are some unattended side effects, but I wanted to get your thoughts and opinions @mlsmaycon and @pascal-fischer and @lixmal.

More Info:

This essential tries to ensure that host dns changes aren't applied til at least one routing peer is connected and DNS servers are reachable. It does so by tying deactivation and reactivation of upstream servers to Peerlist state changes in the status recorder.

As it stands, now status -d doesn't accurately reflect the status on first connection. That is because I haven't figure out an effective method to deactivate the servers without more intense changes. Primarily because of how the deactivate and reactivate callbacks are structure and their dependence of the removeIndex. One potential workaround is to modify deactivate to skip applying host config, so that we can fix this side effect.

Thoughts and discussion please

Issue ticket number and link

Checklist

  • [ ] Is it a bug fix
  • [ ] Is a typo/documentation fix
  • [x] Is a feature enhancement
  • [x] It is a refactor
  • [ ] Created tests that fail without the change (if possible)
  • [ ] Extended the README / documentation, if necessary

hurricanehrndz avatar Jul 19 '24 14:07 hurricanehrndz

This essential tries to ensure that host dns changes aren't applied til at least one routing peer is connected and DNS servers are reachable.

This approach would break non-routed dns servers (e.g. 8.8.8.8 without exit node), correct?

lixmal avatar Aug 02 '24 08:08 lixmal

This essential tries to ensure that host dns changes aren't applied til at least one routing peer is connected and DNS servers are reachable.

This approach would break non-routed dns servers (e.g. 8.8.8.8 without exit node), correct?

Not at all, it would just not apply settings until the client is connected to one peer. I am going to try and actually formulate this to a PR

I may update the activate and deactivate upstream handlers to take a flag to skip applying system settings so as fix the DNS updates from management.

hurricanehrndz avatar Aug 07 '24 14:08 hurricanehrndz

@lixmal this is ready not sure if perhaps this should be gated under a config option and the default would be maintain the current behavior. Test and let me know what you think

hurricanehrndz avatar Aug 15 '24 22:08 hurricanehrndz

Actually I got an idea to make a hybrid model, let me get that going instead

hurricanehrndz avatar Aug 16 '24 00:08 hurricanehrndz

Looking forward for this fix.

If I could help in any way to test this (for example on a setup I described in https://github.com/netbirdio/netbird/issues/2002#issuecomment-2293128357) let me know.

LeszekBlazewski avatar Aug 16 '24 09:08 LeszekBlazewski

Okay so this will waitforresponse and trigger probes base when a peer connstatus changes

hurricanehrndz avatar Aug 16 '24 16:08 hurricanehrndz

@lixmal and @mlsmaycon can we get a build so @LeszekBlazewski can provide some feedback

hurricanehrndz avatar Aug 16 '24 16:08 hurricanehrndz

You can find the binaries in the artifacts in this PR: https://github.com/netbirdio/netbird/actions/runs/10422794281

lixmal avatar Aug 16 '24 16:08 lixmal

Hey guys,

I have run some tests with the latest binaries built in this PR.

My setup:

  1. private R53 hosted zone with a configured inbound resolver (3 ips in range 10.3.X.X, only configured to match specific domains). Those 3 IP addresses are only accessible within VPC which runs a netbird peer. For the whole 10.3.0.0/16 range I have a network route configured which points all that traffic to the peer which runs inside that VPC and therefore is allowed to use the resolver.
  2. Imaginary DNS nameserver (ip 1.2.3.4, match all domains) that won't ever resolve
  3. Just for testing sake, the cloudflare and google DNS resolver (match all domains)
  4. Exit node setup

After comparing the current behaviour (netbird 0.28.7) vs the built binaries, I have found out that indeed the introduced changes postpone DNS servers evaluation until a routing peer is connected which does potentially address the issue mentioned in https://github.com/netbirdio/netbird/issues/2002 ... unless the nameserver is not supposed to be handled by that connected peer. Based on the logs, I have observed that nameserver probing is triggered regardless of the peer that got just connected ( I guess it's assumed that all configured netbird DNS servers can be handled by all connected peers or there are other technical limitations which prevent figuring out such information, as described below).

For example in my described setup, it would make sense to probe the nameservers from point 1 only if the routing peer which handles 10.3.X.X/16 would be connected. I am aware that such condition might be challenging to meet since there is no peer assignment for DNS resolution handling (only for distribution). Also the public DNS resolution (Google and Cloudflare) make the decision wether to probe / don't probe DNS even more harder since no one configures network routes for those DNS servers (there is a reason why they are public).

But still, I think that not applying the config when the DNS server is not reachable is key here and that seems to work. Moreover the backoff over time will fix the above mentioned issue and all the DNS servers who are reachable will be connected.

LeszekBlazewski avatar Aug 19 '24 21:08 LeszekBlazewski

Hey guys,

I have run some tests with the latest binaries built in this PR.

My setup:

1. private R53 hosted zone with a configured inbound resolver (3 ips in range `10.3.X.X`, only configured to match specific domains). Those 3 IP addresses are only accessible within VPC which runs a netbird peer. For the whole 10.3.0.0/16 range I have a network route configured which points all that traffic to the peer which runs inside that VPC and therefore is allowed to use the resolver.

2. Imaginary DNS nameserver (ip 1.2.3.4, match all domains) that won't ever resolve

3. Just for testing sake, the cloudflare and google DNS resolver (match all domains)

4. Exit node setup

After comparing the current behaviour (netbird 0.28.7) vs the built binaries, I have found out that indeed the introduced changes postpone DNS servers evaluation until a routing peer is connected which does potentially address the issue mentioned in #2002 ... unless the nameserver is not supposed to be handled by that connected peer. Based on the logs, I have observed that nameserver probing is triggered regardless of the peer that got just connected ( I guess it's assumed that all configured netbird DNS servers can be handled by all connected peers or there are other technical limitations which prevent figuring out such information, as described below).

For example in my described setup, it would make sense to probe the nameservers from point 1 only if the routing peer which handles 10.3.X.X/16 would be connected. I am aware that such condition might be challenging to meet since there is no peer assignment for DNS resolution handling (only for distribution). Also the public DNS resolution (Google and Cloudflare) make the decision wether to probe / don't probe DNS even more harder since no one configures network routes for those DNS servers (there is a reason why they are public).

But still, I think that not applying the config when the DNS server is not reachable is key here and that seems to work. Moreover the backoff over time will fix the above mentioned issue and all the DNS servers who are reachable will be connected.

I made this patch to show what is one potential fix, probing base on routes when a peer connects will get more complicated, but it is possible.

This patch still includes the old logic as fallback, so triggering if public DNS servers are configured they should get apply very early on

hurricanehrndz avatar Aug 19 '24 22:08 hurricanehrndz

Hey guys,

I have run some tests with the latest binaries built in this PR.

My setup:

1. private R53 hosted zone with a configured inbound resolver (3 ips in range `10.3.X.X`, only configured to match specific domains). Those 3 IP addresses are only accessible within VPC which runs a netbird peer. For the whole 10.3.0.0/16 range I have a network route configured which points all that traffic to the peer which runs inside that VPC and therefore is allowed to use the resolver.

2. Imaginary DNS nameserver (ip 1.2.3.4, match all domains) that won't ever resolve

3. Just for testing sake, the cloudflare and google DNS resolver (match all domains)

4. Exit node setup

After comparing the current behaviour (netbird 0.28.7) vs the built binaries, I have found out that indeed the introduced changes postpone DNS servers evaluation until a routing peer is connected which does potentially address the issue mentioned in #2002 ... unless the nameserver is not supposed to be handled by that connected peer. Based on the logs, I have observed that nameserver probing is triggered regardless of the peer that got just connected ( I guess it's assumed that all configured netbird DNS servers can be handled by all connected peers or there are other technical limitations which prevent figuring out such information, as described below).

For example in my described setup, it would make sense to probe the nameservers from point 1 only if the routing peer which handles 10.3.X.X/16 would be connected. I am aware that such condition might be challenging to meet since there is no peer assignment for DNS resolution handling (only for distribution). Also the public DNS resolution (Google and Cloudflare) make the decision wether to probe / don't probe DNS even more harder since no one configures network routes for those DNS servers (there is a reason why they are public).

But still, I think that not applying the config when the DNS server is not reachable is key here and that seems to work. Moreover the backoff over time will fix the above mentioned issue and all the DNS servers who are reachable will be connected.

I will try and add the ability to trigger a peer with route. What I gather though was that the App was performing and behaving better apply DNS changes base on when peers connect, is that right?

hurricanehrndz avatar Aug 20 '24 00:08 hurricanehrndz

Yeah, for sure. With the patch from this PR the amount of timed out calls to private NS is minimised.

One thing I wanted to figure out and I am wondering if this could be related to my setup somehow is: Based on the logs, even after the peer which handles 10.3.X.X traffic connects (logs indicate that), the first few calls to probe the private NS still fail and only after few seconds they come back online. I will try to investigate this since I am confident that those NS are available at all times and it could be related to the exit node node or the routing peer connecting as relay.

LeszekBlazewski avatar Aug 20 '24 05:08 LeszekBlazewski

Yeah, for sure. With the patch from this PR the amount of timed out calls to private NS is minimised.

One thing I wanted to figure out and I am wondering if this could be related to my setup somehow is: Based on the logs, even after the peer which handles 10.3.X.X traffic connects (logs indicate that), the first few calls to probe the private NS still fail and only after few seconds they come back online. I will try to investigate this since I am confident that those NS are available at all times and it could be related to the exit node node or the routing peer connecting as relay.

Do you have any system extensions that filter DNS requests, such as Cisco Umbrella? Assuming this is a macOS device

hurricanehrndz avatar Aug 20 '24 13:08 hurricanehrndz

The only thing running in the background is eset with the Network Access Protection enabled which might do some DNS filtering (I am not sure since I don't have access to the management of this antivirus but I was told it does not). I will make sure to disable it when testing.

What is interesting is that the NS requests fail only initially and then just work without me doing any changes.

LeszekBlazewski avatar Aug 21 '24 06:08 LeszekBlazewski

The only thing running in the background is eset with the Network Access Protection enabled which might do some DNS filtering (I am not sure since I don't have access to the management of this antivirus but I was told it does not). I will make sure to disable it when testing.

What is interesting is that the NS requests fail only initially and then just work without me doing any changes.

This patch now support enabling publicly accessible servers at processing the DNS update time and/or start

hurricanehrndz avatar Aug 22 '24 18:08 hurricanehrndz

@LeszekBlazewski

Minor updates to behavior in order to ensure publicly accessible servers work from the start without a peer. I unfortunately didn't filter probing base on network. So test this one out. I will see what I can do about that front, but this would need to be more heavily refactored since I would need to start another network monitor

hurricanehrndz avatar Aug 22 '24 18:08 hurricanehrndz

Indeed, did a quick test and the public nameservers were enabled almost instantly whereas the rest got processed later on. I am doing some checks on why the first initial few nameserver requests after the connection to the peer which is supposed to handle that traffic still fail.

LeszekBlazewski avatar Aug 23 '24 05:08 LeszekBlazewski

So this is ready for review, we have been running this for about two weeks with some excellent results.

@LeszekBlazewski I will need to fix some golang bugs before we can have DNS probing on routes. After fixing that refactoring of the networkmonitor would be necessary. I am not sure if all these changes can be implemented in a timely manner, so for now this patch is a good stop gap. I am going to go fix 44740 for now

https://github.com/golang/go/issues/44740

hurricanehrndz avatar Aug 29 '24 13:08 hurricanehrndz

Thanks for the update @hurricanehrndz. I have been running the patch for a while as well (the one before latest few commits) and it has been working good (does not fully solve my case but still is better than the current behaviour).

Thanks for the tremendous work on this! I think that going with the approach you suggested makes perfect sense and this PR should be considered to be merged.

LeszekBlazewski avatar Aug 29 '24 16:08 LeszekBlazewski

@LeszekBlazewski can you test the latest build, the patch has been updated to work with the latest 29.x releases. I believe @mlsmaycon will be reviewing this soon

https://github.com/netbirdio/netbird/actions/runs/10997072510?pr=2291

hurricanehrndz avatar Sep 23 '24 15:09 hurricanehrndz

Hi, sorry but I won't be able to test this until the 7th of October.

On Mon, 23 Sept 2024, 17:16 Carlos Hernandez, @.***> wrote:

@LeszekBlazewski https://github.com/LeszekBlazewski can you test the latest build, the patch has been updated to work with the latest 29.x releases. I believe @mlsmaycon https://github.com/mlsmaycon will be reviewing this soon

https://github.com/netbirdio/netbird/actions/runs/10997072510?pr=2291

— Reply to this email directly, view it on GitHub https://github.com/netbirdio/netbird/pull/2291#issuecomment-2368607283, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIKPEJU26ZGWPW6TVGSJ7I3ZYAWDXAVCNFSM6AAAAABLEXUR26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRYGYYDOMRYGM . You are receiving this because you were mentioned.Message ID: @.***>

LeszekBlazewski avatar Sep 23 '24 18:09 LeszekBlazewski

@LeszekBlazewski this now applies settings base on routing table

hurricanehrndz avatar Nov 15 '24 16:11 hurricanehrndz

i have same problem, like #2002, and this PR fix my problem.

But, it's possible to be rebase to last version ?

Eu-Arthur avatar Nov 28 '24 15:11 Eu-Arthur

Yeah, I will rebase it and see if I can fix some tests.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Eustache Arthur @.> Sent: Thursday, November 28, 2024 8:02:26 AM To: netbirdio/netbird @.> Cc: Carlos Hernandez @.>; Mention @.> Subject: Re: [netbirdio/netbird] Apply host DNS settings on peer state change (PR #2291)

i have same problem, like #2002https://github.com/netbirdio/netbird/issues/2002, and this PR fix my problem.

But, it's possible to be rebase to last version ?

— Reply to this email directly, view it on GitHubhttps://github.com/netbirdio/netbird/pull/2291#issuecomment-2506318287, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABMJBTKCQ4TGDXQHY3L55TT2C4WAFAVCNFSM6AAAAABLEXUR26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBWGMYTQMRYG4. You are receiving this because you were mentioned.Message ID: @.***>

hurricanehrndz avatar Nov 29 '24 13:11 hurricanehrndz

i have same problem, like #2002, and this PR fix my problem.

But, it's possible to be rebase to last version ?

I have updated this PR, but it is a WIP, please test. I need update this PR to support other platforms

hurricanehrndz avatar Nov 29 '24 21:11 hurricanehrndz

Yes, i don't see problem, for me that work on latest version

Eu-Arthur avatar Nov 30 '24 10:11 Eu-Arthur

@hurricanehrndz let's close this PR for now. We can open again once you are ready.

mlsmaycon avatar Feb 03 '25 08:02 mlsmaycon