netbird icon indicating copy to clipboard operation
netbird copied to clipboard

Broken DNS resolution of *.our.domain on Windows client

Open tomashora opened this issue 1 year ago • 26 comments

Describe the problem

The resolution of all subdomains under *our.domain does not work for certain application (for example any web browser or ping tool, however nslookup resolves IP correctly). This used to happen in the past when netbird was shut down incorrectly, as discussed on Slack. Now it seems to happen the same way - hard laptop shutdown, system boots up, dns not resolved.

This results that the clients cannot connect.

To Reproduce

Steps to reproduce the behavior: TBD

Expected behavior

All DNS records should be resolved correctly.

Are you using NetBird Cloud?

Self-hosted (v0.31.1 incl. relay as well as coturn)

NetBird version

0.31.1

NetBird status -dA output:

X

Do you face any (non-mobile) client issues?

2024-11-15T08:27:16+01:00 ERRO util/grpc/dialer.go:38: Failed to dial: dial: dial tcp: lookup netbird.our.domain: no such host

Screenshots

X

Additional context

The easiest to fix it is to connecto to the Netbird Cloud instance, which somehows resets the windows DNS configutation so the *.our.domain is immediately resolved correctly.

Output of Resolve-DnsName -Name www.our.domain Resolve-DnsName: www.unipi.technology : Daná operace se vrátila, protože vypršel časový limit. //Time exceeded

Output of: ping www.our.domain Ping request could not find host www.our.domain. Please check the name and try again.

Output of: nslookup www.our.domain

Server:  dns.google
Address:  8.8.8.8

Non-authoritative answer:
Name:    our.domain
Address:  correct IP address
Aliases:  www.our.domain

Output of Get-DnsClientNrptPolicy

Namespace                        : .ourdomain.local
QueryPolicy                      :
SecureNameQueryFallback          :
DirectAccessIPsecCARestriction   :
DirectAccessProxyName            :
DirectAccessDnsServers           :
DirectAccessEnabled              :
DirectAccessProxyType            : NoProxy
DirectAccessQueryIPsecEncryption :
DirectAccessQueryIPsecRequired   : False
NameServers                      : 10.220.255.254
DnsSecIPsecCARestriction         :
DnsSecQueryIPsecEncryption       :
DnsSecQueryIPsecRequired         : False
DnsSecValidationRequired         : False
NameEncoding                     : Utf8WithoutMapping

Namespace                        : .our.domain
QueryPolicy                      :
SecureNameQueryFallback          :
DirectAccessIPsecCARestriction   :
DirectAccessProxyName            :
DirectAccessDnsServers           :
DirectAccessEnabled              :
DirectAccessProxyType            : NoProxy
DirectAccessQueryIPsecEncryption :
DirectAccessQueryIPsecRequired   : False
NameServers                      : 10.220.255.254
DnsSecIPsecCARestriction         :
DnsSecQueryIPsecEncryption       :
DnsSecQueryIPsecRequired         : False
DnsSecValidationRequired         : False
NameEncoding                     : Utf8WithoutMapping

tomashora avatar Nov 15 '24 08:11 tomashora

try deleting this registry-key, when this happens. Computer\HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\Dnscache\Parameters\DnsPolicyConfig\NetBird-Match

can you also confirm that the domain used for your netbird controller is also supposed to be routed to through the wireguard tunnel after the connection is established?

roberthase avatar Nov 18 '24 19:11 roberthase

@roberthase thanks, this did the trick, is it possible to be done by Netbird service, as this requires elevated permissions? It behaved rather weirdly, before I did your trick, but I am not an expert on how networking works in Windows. nslookup returned a proper result from the main DNS resolver, but ping or traceroute failed.

Not sure if I get your question, but the management and admin domains are a subdomains (management.example.com) of the domain that was stuck in the registry (.example.com)

cleveHEX avatar Nov 22 '24 08:11 cleveHEX

The registry key i posted gets created when netbird successfully connects.

All matched domains you configured in your controller under DNS -> Nameservers are listed here.

With matched domains configured, every domain you entered can only be accessible over the wireguard tunnel/interface.

When netbird is gracefully shutdown/disconnected, the registry gets deleted.

There can be instances, where your windows os could not shutdown correctly and thats where things get ugly, if the domain of your controller is also a matched domain.

You boot your system and the registry key is still there and now your netbird-client can't reach your controller netbird.example.com, because example.com is supposed to go through your wireguard tunnel/interface.

Nslookup should give you the right result, because its using the dns server configured on your pc or your router, but the routing is wrong.

roberthase avatar Nov 22 '24 20:11 roberthase

Thank you for the explanation. I have noticed that Netbird removes those entries on graceful deactivation. I thought if Netbird could try to delete this entry also on its start (pre-start clean up, before it starts actually doing something).

Anyways, I will propose to move the management out of the domains that go via wireguard. Unfortunately this will mean a lot of changes (management URL will stay the same, but domains and services will be buried one level lower under one more subdomain).

cleveHEX avatar Nov 24 '24 12:11 cleveHEX

It's still happening randomly also with client 0.35.2

tomashora avatar Jan 06 '25 08:01 tomashora

we route only relevant subdomains now, so netbird.example.com is not affected. maybe this is also possible for your enviroment.

roberthase avatar Jan 06 '25 08:01 roberthase

It's still happening randomly also with client 0.35.2

We are experiencing the same issue with latest client version. Could this be corrected so that netbird client takes care of cleanup each time host is started or as @cleveHEX has suggested on client start?

jakovnikolic avatar Jan 08 '25 08:01 jakovnikolic

@jakovnikolic both of these are implemented. Can you share more about the issue you have?

lixmal avatar Jan 08 '25 11:01 lixmal

This morning i have got support request from 2 of my colleagues telling me they are not able to connect or access any of our internal domains. They have been using Windows operating system and they have just updated clients to latest version v0.35.2.

After manual removal of HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\Dnscache\Parameters\DnsPolicyConfig\NetBird-Match they have managed to connect and access everything as expected.

We have tested running nslookup and domain of our VPN server resolves without any issues but using ping we would get no response. Since nslookup opens a winsock connection on the DNS port and issues a query, whereas ping uses the DNS Client service.

jakovnikolic avatar Jan 08 '25 11:01 jakovnikolic

This morning i have got support request from 2 of my colleagues telling me they are not able to connect or access any of our internal domains. They have been using Windows operating system and they have just updated clients to latest version v0.35.2.

After manual removal of HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\Dnscache\Parameters\DnsPolicyConfig\NetBird-Match they have managed to connect and access everything as expected.

We have tested running nslookup and domain of our VPN server resolves without any issues but using ping we would get no response. Since nslookup opens a winsock connection on the DNS port and issues a query, whereas ping uses the DNS Client service.

Exactly the same for us

tomashora avatar Jan 08 '25 11:01 tomashora

Can you share the contents of %PROGRAMDATA\Netbird\state.json the next time this happens before starting netbird?

Does this happen on boot or when waking up from sleep? If after sleep, does rebooting fix the issue (netbird cleans up on service start)?

Does netbird state clean dns_state fix it? Does netbird service restart fix it?

Also logs would be helpful, e.g. with netbird debug bundle -A once netbird is running again

lixmal avatar Jan 08 '25 11:01 lixmal

@roberthase I am seeing same issue on macOS, what is the alternate of registry clearing on MacOS ?

Amit-Tomar-Livspace avatar Jan 09 '25 04:01 Amit-Tomar-Livspace

sorry. i do not know where these settings are stored on MacOS.

To follow up with a new issues we experienced after upgrading clients, routingpeers and the controller from 0.31.0 0.35.2.

On some Windows 11 Clients the registry key for the routes is created after the connection is established and then is immediately deleted afterwards.

So these clients can connect to the controller but not the routes.

Downgrading or uninstalling/reinstalling the client to 0.31.0 has no effect. The new issues persists.

Any advice on how to remedy this issue?

roberthase avatar Jan 15 '25 14:01 roberthase

For anyone's future reference, this is how I resolved temporarily on MacOS:

  1. Open the file sudo vim /etc/hosts.
  2. Manually find the IP address of your company domain which is not getting resolved.
  3. At the end of above file, add an entry corresponding to your data so that DNS resolution can happen. eg.

99.22.11.33 foo.mydomain.com

Once this was done, netbird started working properly. After this I removed the entry from hosts file and netbird continues to work fine. Adding this entry anywhere else in dns resolution setting was not working. I believe hosts file takes preference over everything else and hence it worked.

Amit-Tomar-Livspace avatar Jan 16 '25 06:01 Amit-Tomar-Livspace

Can you share the contents of %PROGRAMDATA\Netbird\state.json the next time this happens before starting netbird?

The file did not exist

Does this happen on boot or when waking up from sleep? If after sleep, does rebooting fix the issue (netbird cleans up on service start)?

The person did not know how this happened.

Does netbird state clean dns_state fix it? Does netbird service restart fix it?

None of these fixed the issue, only the manual registry edit.

Also logs would be helpful, e.g. with netbird debug bundle -A once netbird is running again

netbird.debug.1270581651.zip

cleveHEX avatar Jan 22 '25 14:01 cleveHEX

@cleveHEX thank you.

The state file seems to be corrupted, hence the cleanup fails:

2025-01-21T11:23:10+01:00 WARN client/internal/statemanager/manager.go:307: State file appears to be corrupted, attempting to delete itinvalid character '\x00' looking for beginning of value 2025-01-21T11:23:10+01:00 INFO client/internal/statemanager/manager.go:311: State file deleted 2025-01-21T11:23:10+01:00 WARN client/server/server.go:109: failed to restore residual state: 1 error occurred: * perform cleanup: load state file: unmarshal states: invalid character '\x00' looking for beginning of value

I'll see if I can reproduce the issue

lixmal avatar Jan 22 '25 15:01 lixmal

Happened to me today when I got to the computer and I found out that my PC made a BSOD over night with no dump available.

cleveHEX avatar Jan 23 '25 07:01 cleveHEX

@lixmal happened again after windows update

netbird.debug.373164297.zip state.json

tomashora avatar Jan 31 '25 13:01 tomashora

@lixmal today again also after updating Windows

netbird.debug.1700204602.zip - before fix by removing registry netbird.debug.1700204602 1.zip - after fix and reconnect

tomashora avatar Feb 05 '25 13:02 tomashora

UPDATE: After a reboot, things seem to be back to normal. #============================== I've the same problem with 0.37.2 as describe in #3468 . After update to 0.38.0 today, I've got this issue again. But what weird is that I can't find Computer\HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\Dnscache\Parameters\DnsPolicyConfig\NetBird-Match key in regeditor. And GPO's DNS policy doesn't have any relative record either. Only Get-DnsClientNrptPolicy show the policy.

state.json netbird.debug.820797045.zip

powershell output

PS C:\Users\kortan> Get-DnsClientGlobalSetting


UseSuffixSearchList : False
SuffixSearchList    : {}
UseDevolution       : True
DevolutionLevel     : 0



PS C:\Users\kortan> Get-DnsClientNrptGlobal

EnableDAForAllNetworks QueryPolicy SecureNameQueryFallback
---------------------- ----------- -----------------------
Disable                Disable     Disable


PS C:\Users\kortan> Get-DnsClientNrptRule


Name                             : {C43E1699-C69D-4DFB-9737-AF855565626D}
Version                          : 1
Namespace                        : {.85.100.in-addr.arpa, .86.100.in-addr.arpa, .87.100.in-addr.arpa, .88.100.in-addr.a
                                   rpa...}
IPsecCARestriction               :
DirectAccessDnsServers           :
DirectAccessEnabled              : False
DirectAccessProxyType            :
DirectAccessProxyName            :
DirectAccessQueryIPsecEncryption :
DirectAccessQueryIPsecRequired   :
NameServers                      : 100.100.100.100
DnsSecEnabled                    : False
DnsSecQueryIPsecEncryption       :
DnsSecQueryIPsecRequired         :
DnsSecValidationRequired         :
NameEncoding                     : Disable
DisplayName                      :
Comment                          :



PS C:\Users\kortan> Get-DnsClientNrptPolicy


Namespace                        : .65.100.in-addr.arpa
QueryPolicy                      :
SecureNameQueryFallback          :
DirectAccessIPsecCARestriction   :
DirectAccessProxyName            :
DirectAccessDnsServers           :
DirectAccessEnabled              :
DirectAccessProxyType            : NoProxy
DirectAccessQueryIPsecEncryption :
DirectAccessQueryIPsecRequired   : False
NameServers                      : 100.65.255.254
DnsSecIPsecCARestriction         :
DnsSecQueryIPsecEncryption       :
DnsSecQueryIPsecRequired         : False
DnsSecValidationRequired         : False
NameEncoding                     : Utf8WithoutMapping

Namespace                        : .netbird.some.domain
QueryPolicy                      :
SecureNameQueryFallback          :
DirectAccessIPsecCARestriction   :
DirectAccessProxyName            :
DirectAccessDnsServers           :
DirectAccessEnabled              :
DirectAccessProxyType            : NoProxy
DirectAccessQueryIPsecEncryption :
DirectAccessQueryIPsecRequired   : False
NameServers                      : 100.65.255.254
DnsSecIPsecCARestriction         :
DnsSecQueryIPsecEncryption       :
DnsSecQueryIPsecRequired         : False
DnsSecValidationRequired         : False
NameEncoding                     : Utf8WithoutMapping

Namespace                        : .some.domain
QueryPolicy                      :
SecureNameQueryFallback          :
DirectAccessIPsecCARestriction   :
DirectAccessProxyName            :
DirectAccessDnsServers           :
DirectAccessEnabled              :
DirectAccessProxyType            : NoProxy
DirectAccessQueryIPsecEncryption :
DirectAccessQueryIPsecRequired   : False
NameServers                      : 100.65.255.254
DnsSecIPsecCARestriction         :
DnsSecQueryIPsecEncryption       :
DnsSecQueryIPsecRequired         : False
DnsSecValidationRequired         : False
NameEncoding                     : Utf8WithoutMapping

Namespace                        : .relay.some.domain
QueryPolicy                      :
SecureNameQueryFallback          :
DirectAccessIPsecCARestriction   :
DirectAccessProxyName            :
DirectAccessDnsServers           :
DirectAccessEnabled              :
DirectAccessProxyType            : NoProxy
DirectAccessQueryIPsecEncryption :
DirectAccessQueryIPsecRequired   : False
NameServers                      : 100.65.255.254
DnsSecIPsecCARestriction         :
DnsSecQueryIPsecEncryption       :
DnsSecQueryIPsecRequired         : False
DnsSecValidationRequired         : False
NameEncoding                     : Utf8WithoutMapping

Namespace                        : .management.some.domain
QueryPolicy                      :
SecureNameQueryFallback          :
DirectAccessIPsecCARestriction   :
DirectAccessProxyName            :
DirectAccessDnsServers           :
DirectAccessEnabled              :
DirectAccessProxyType            : NoProxy
DirectAccessQueryIPsecEncryption :
DirectAccessQueryIPsecRequired   : False
NameServers                      : 100.65.255.254
DnsSecIPsecCARestriction         :
DnsSecQueryIPsecEncryption       :
DnsSecQueryIPsecRequired         : False
DnsSecValidationRequired         : False
NameEncoding                     : Utf8WithoutMapping



PS C:\Users\kortan> Get-ChildItem -Path "HKLM:\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters\DnsPolicyConfig"


    Hive: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters\DnsPolicyConfig


Name                           Property
----                           --------
{C43E1699-C69D-4DFB-9737-AF855 Version           : 1
565626D}                       Name              : {.85.100.in-addr.arpa, .86.100.in-addr.arpa, .87.100.in-addr.arpa, .
                               88.100.in-addr.arpa...}
                               GenericDNSServers : 100.100.100.100
                               ConfigOptions     : 8


PS C:\Users\kortan> Get-ChildItem -Path "HKLM:\SOFTWARE\Policies\Microsoft\Windows NT\DNSClient\DnsPolicyConfig"
PS C:\Users\kortan>

KortanZ avatar Mar 11 '25 15:03 KortanZ

@lixmal Did not happen for some time on 0.39.1 and 0.39.2 but with 0.40 this happened again. It usually happens after waking up the laptop (after opening the lid). Fortunately, probably thanks to https://github.com/netbirdio/netbird/pull/3614, if fixed itself after netebird down&up

2025-04-08T14:34:53+02:00 INFO client/internal/statemanager/manager.go:412: cleaning up state dns_state
2025-04-08T14:34:53+02:00 WARN client/server/server.go:590: failed to restore residual state: 1 error occurred:
	* perform cleanup: 1 error occurred:
	* dns_state: cleanup state: restore unclean shutdown dns: remove interface registry key: get interface registry key: open HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\{F2F29E61-D91F-4D76-8151-119B20C4BDEB}: Systém nemůže nalézt uvedený soubor.
2025-04-08T14:34:53+02:00 INFO client/internal/connect.go:122: starting NetBird client version 0.40.0 on windows/amd64
2025-04-08T14:34:54+02:00 INFO client/internal/engine.go:320: stopped Netbird Engine

netbird.debug.2492282057.zip

tomashora avatar Apr 08 '25 21:04 tomashora

@lixmal Happened already twice today. Down&up fixed it. Btw what is the preffered way to access private DNS? Should it be peer or network route (currently used). Debug enabled after this dump.

netbird.debug.1355838872.zip

tomashora avatar Apr 09 '25 07:04 tomashora

@lixmal with 0.40.+ the resolution of *.our.domain is broken after every laptop hibernation/sleep and can be fixed by disconnect/connect. it's really frustrating and practically unusable for our clients

tomashora avatar Apr 17 '25 07:04 tomashora

Can you avoid using your top level domain in your match-domains and use subdomains instead? This way netbird.your.domain is not routed through the wireguard tunnel. Thats what i did and i never had an issue since.

roberthase avatar Apr 17 '25 07:04 roberthase

The issue was in the configuration of the nameserver in Netbird, where the matchdomain was *.our.domain but the netbird management is running on netbird.our.domain. After removing *.our.domain from the match domain it never happened again - tested for two days but before it happened instantly everytime closing/opening the lid of laptop.

tomashora avatar Apr 22 '25 21:04 tomashora

I too experienced this and while removing *.our.domain in "Match Domains" works. I think that this is a temporary solution. I think Netbird client should remove name resolution policy when it disconnects and respect system dns settings. Not sure if this is how other ZTNA works.

laichenkang-cathay avatar Jun 04 '25 02:06 laichenkang-cathay