lxd icon indicating copy to clipboard operation
lxd copied to clipboard

LXD zone transfer configuration breaks after reboot.

Open ilyaredis opened this issue 1 year ago • 2 comments

Required information

  • Distribution version: 5.20
  • The output of "snap list --all lxd core20 core22 core24 snapd": see attached snap_list_all.txt
  • The output of "lxc info" or if that fails: see attached lxc_info.txt
  • Kernel version:Linux 6.1.0-1033-oem x86_64
  • LXC version: 5.20
  • LXD version: 5.20
  • Storage backend in use: local SSD

Issue description

I am consistently running into an issue after configuring the Zone transfer feature with LXD For the guidelines I use this page and the demo here: https://documentation.ubuntu.com/lxd/en/stable-5.0/howto/network_zones/ I also use the configuration with port 8853 like so: lxc config set core.dns.address 10.157.229.1:8853

My DNS server (I also use NSD) is also deployed locally as a VM on the host machine. And once configured it all works as described. However, The issue is that once I reboot the host machine, the NSD server fails to connect on the same port. the issue is unrecoverable and requires setup anew.

Here is the journalctl -f log fragment from the NSD server: Feb 23 01:54:05 ns1 nsd[397]: redistest.com: Could not tcp connect to 10.157.229.1@8853: Connection refused Feb 23 01:54:05 ns1 nsd[397]: 229.157.10.in-addr.arpa: Could not tcp connect to 10.157.229.1@8853: Connection refused

Please advise if it is a known issue and if workaround is available.

Thanks!

Attachments:

snap_list_all.txt lxc_info.txt

ilyaredis avatar Feb 24 '24 02:02 ilyaredis

I could reproduce. It seems like unsetting and setting the core.dns_address config key again is needed to restart the listener.

escabo avatar Feb 28 '24 21:02 escabo

Thank you for the prompt update! I will use that as a workaround.

ilyaredis avatar Feb 29 '24 02:02 ilyaredis

@ilyaredi, is 10.157.229.1 an IP configured on one of the networks managed by LXD, like lxdbr0?

I was able to reproduce this issue by binding the DNS listener on lxdbr0 which presumably causes LXD to try to start the DNS listener before lxdbr0's IP is configured.

simondeziel avatar Mar 12 '24 20:03 simondeziel

@simondeziel, Yes. The 10.157.229.1 network is managed. Below is the edited output of the lxc network list command.

+-----------------+----------+---------+-----------------+---------------------------+-------------+---------+---------+
|      NAME       |   TYPE   | MANAGED |      IPV4       |           IPV6            | DESCRIPTION | USED BY |  STATE  |
+-----------------+----------+---------+-----------------+---------------------------+-------------+---------+---------+
...
+-----------------+----------+---------+-----------------+---------------------------+-------------+---------+---------+
| lxdbr0          | bridge   | YES     | 10.40.36.1/24   | fd42:61ce:b91e:6a82::1/64 |             | 1       | CREATED |
+-----------------+----------+---------+-----------------+---------------------------+-------------+---------+---------+
| redistestnet    | bridge   | YES     | 10.157.229.1/24 | none                      |             | 2       | CREATED |
+-----------------+----------+---------+-----------------+---------------------------+-------------+---------+---------+

...

ilyaredis avatar Mar 13 '24 18:03 ilyaredis

@tomponline I don't know what's the current recommendation on binding LXD services on IPs assigned to devices it manages itself? Should we bind our services later or a few bind retries? Or maybe we should document that limitation?

@ilyaredis another workaround is to bind that core.dns_address on all IPs (lxc config set core.dns_address '[::]:8853' or lxc config set core.dns_address '0.0.0.0:8853') and maybe put a firewall rule to ensure only connections from lxdbr0 are authorized. Such firewall rule would look like: iptables -A INPUT -p tcp --dport 8853 -i lxdbr0 -j ACCEPT # LXD auth DNS. This firewall rule assumes the INPUT policy is to DROP traffic, of course.

simondeziel avatar Mar 13 '24 18:03 simondeziel

Thank you, @simondeziel, for the update! I will use this workaround.

ilyaredis avatar Mar 13 '24 23:03 ilyaredis

@tomponline I don't know what's the current recommendation on binding LXD services on IPs assigned to devices it manages itself? Should we bind our services later or a few bind retries? Or maybe we should document that limitation?

We should probably ensure the listeners are started after networks. We've had some similar issues in the past wrt to the metrics listener I think....

Ah yes here we go, we should move the BGP and DNS listener setup after networks are initialised, just like we do for the metrics and S3 buckets listener:

https://github.com/canonical/lxd/blob/main/lxd/daemon.go#L1447-L1501 https://github.com/canonical/lxd/blob/main/lxd/daemon.go#L1503-L1517

However there may be an issue with that as BGP and DNS servers are used/modified by the networks, so there may need to be some refactoring to allow the networks to configure the BGP/DNS servers before their listeners actually start.

tomponline avatar Mar 15 '24 16:03 tomponline

@escabo this sounds like a similar class of bug to https://github.com/canonical/lxd/issues/12185 perhaps we can fix together?

tomponline avatar Apr 11 '24 10:04 tomponline