avahi
avahi copied to clipboard
avahi-browse bug with large LAN --terminate and --cache randomly never terminates
Hello,
Reporting an issue discussed in #avahi on freenode with lathiat:
I am experiencing an issue where the avahi-browse command never terminates when it should, randomly, on a large network.
I have a network with ~300 publishing devices. I find that about ~100 devices it is fine, about ~150ish we start noticing the issue, and with ~300+ it's quite noticeable and easily reproducible, but doesn't always happen.
My automation software is regularly running the avahi-browse commands to pull detected node information, and then connect to devices and perform various operations. Right now as a work-around I am having it timeout out after a few seconds but it happens often enough with this many devices (this much mDNS traffic?) that the web UI for the software becomes noticeable slow waiting for detection attempts to timeout and retry a lot.
The detection software I wrote, fairly basic by calling avahi-browse from Python with a timeout, is running on a CentOS 7 server. The server is always up, and regarding other time sync bugs, uses chronyd with CentOS Internet time sources, so that is highly unlikely I think.
The command I am using is: avahi-browse -ltrp ._
This seems to be timing out (my timeout of 15s) about 3% of the time. If I exchange the -t option (terminate) for --cache it does the same thing. I believe it is the -r resolving action that probably has the issues.
When I do have the issue, I see output where it just continually re-resolves things that it has already displayed as if -t was not used.
This is on a large fully 10GigE network if that matters. 16 core by 32 GiB memory on the server if that matters. Probably not.
The avahi-daemon is started with -s and --debug currently.
The config file:
[server] use-ipv4=no use-ipv6=yes allow-interfaces=ens256 deny-interfaces=ens192,ens224 enable-dbus=yes disallow-other-stacks=yes objects-per-client-max=2048 ratelimit-interval-usec=1000000 ratelimit-burst=1000 cache-entries-max=2048
[wide-area] enable-wide-area=no
[publish]
[reflector]
[rlimits] rlimit-core=0 rlimit-data=4194304 rlimit-fsize=0 rlimit-nofile=768 rlimit-stack=4194304 rlimit-nproc=3
@bwfisher82 - did you ever find a way to better tune avahi for a large network?
Nope. Not sure if it's multicast in general or this mDNS implementation. Because pinging the group address you stop getting replies from everything as you add more and more devices. At just 200 devices it takes 30 second ping to ff02::fb to be sure I probably got them all. I doubt it's ipv6 issue btw. We had to stop using this altogether and require users to provide static IPs for the automation program. We briefly considered multiple network segments but never made it that far before we just went with static addressing and no automatic detection on the network --- basically completely removed avahi/bonjour/mdns. :/
On Wed, Mar 22, 2023 at 09:34 Marshall Onsrud @.***> wrote:
@bwfisher82 https://github.com/bwfisher82 - did you ever find a way to better tune avahi for a large network?
— Reply to this email directly, view it on GitHub https://github.com/lathiat/avahi/issues/264#issuecomment-1479679014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHSM4J2GAIVN3FEIYYH4C7TW5MEVZANCNFSM4KXQGLJQ . You are receiving this because you were mentioned.Message ID: @.***>
We are seeing that hosts don't show up for a very long time (up to an hour) via avahi-browse. However, if we restart the avahi-daemon on the remote host or the one on which we are running avahi-browse, the host shows up right away. This is on a network with only ~50 devices. Not a lot of leads out there on Google.
On Wed, Mar 22, 2023 at 10:02 AM Ben Fisher @.***> wrote:
Nope. Not sure if it's multicast in general or this mDNS implementation. Because pinging the group address you stop getting replies from everything as you add more and more devices. At just 200 devices it takes 30 second ping to ff02::fb to be sure I probably got them all. I doubt it's ipv6 issue btw. We had to stop using this altogether and require users to provide static IPs for the automation program. We briefly considered multiple network segments but never made it that far before we just went with static addressing and no automatic detection on the network --- basically completely removed avahi/bonjour/mdns. :/
On Wed, Mar 22, 2023 at 09:34 Marshall Onsrud @.***> wrote:
@bwfisher82 https://github.com/bwfisher82 - did you ever find a way to better tune avahi for a large network?
— Reply to this email directly, view it on GitHub https://github.com/lathiat/avahi/issues/264#issuecomment-1479679014, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AHSM4J2GAIVN3FEIYYH4C7TW5MEVZANCNFSM4KXQGLJQ
. You are receiving this because you were mentioned.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/lathiat/avahi/issues/264#issuecomment-1479731975, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBJMGX23BBH5F477WMM2O3W5MH7RANCNFSM4KXQGLJQ . You are receiving this because you commented.Message ID: @.***>
I reproduced this issue by announcing a service pointing to an unresolvable host name and then sending a goodbye packet before the resolver timed out. Could anyone apply the following patch to see if it helps:
diff --git a/avahi-utils/avahi-browse.c b/avahi-utils/avahi-browse.c
index 4028ca0..f7542ff 100644
--- a/avahi-utils/avahi-browse.c
+++ b/avahi-utils/avahi-browse.c
@@ -284,8 +284,10 @@ static void remove_service(Config *c, ServiceInfo *i) {
AVAHI_LLIST_REMOVE(ServiceInfo, info, services, i);
- if (i->resolver)
+ if (i->resolver) {
avahi_service_resolver_free(i->resolver);
+ n_resolving--;
+ }
avahi_free(i->name);
avahi_free(i->type);
@@ -331,6 +333,7 @@ static void service_browser_callback(
return;
remove_service(c, info);
+ check_terminate(c);
print_service_line(c, '-', interface, protocol, name, type, domain, 1);
break;
?
@evverx I'm not sure how to apply a patch exactly. We just install avahi from RHEL / CentOS repos. I can grab this repo and jump on a given branch / tag (latest branch?) and apply the patch if I have instructions for that part. It may take some time because we don't currently have relevant devices in our data centers, but should soon ish.
@fisherbe I opened https://github.com/avahi/avahi/pull/583 so it should be possible to get that patch by running the following commands:
git clone https://github.com/avahi/avahi
cd avahi
git fetch origin pull/583/head:browse-cache-terminate
git checkout browse-cache-terminate
installing the build dependencies and running
./boostrap.sh
make
avahi-browse
can be run directly from the avahi-utils
directory without having to install anything:
./avahi-utils/avahi-browse -arpt
As mentioned there the PR fixes one particular issue and there can be other issues preventing avahi-browse
from stopping. I found another way to trigger it but it's unlikely to happen in practice unless something really malfunctions somewhere (or does that deliberately). If the patch doesn't help it would be great if you could attach the output of avahi-browse
when it happens and also the output of tcpdump/wireshark
showing incoming/outgoing mDNS packets.