igmpproxy icon indicating copy to clipboard operation
igmpproxy copied to clipboard

igmpproxy sometimes removes the mroute if "Membership Report" message is received close to "Max Resp Time" value specified in the "Membership Query" message

Open tonusoo opened this issue 1 year ago • 2 comments

I have noticed that occasionally igmpproxy removes the multicast route if "Membership Report" message is received close to "Max Resp Time" value specified in the "Membership Query" message. For example, please see the packet capture below:

root@r1:~# tshark -Y 'frame.number in {25..27}' -r br0_igmp.pcap
Running as user "root" and group "root". This could be dangerous.
   25 190.655057  192.168.0.1 → 224.0.0.1    IGMPv2 46 Membership Query, general
   26 196.659055  192.168.0.1 → 224.0.0.1    IGMPv2 46 Membership Query, general
   27 200.665434 192.168.0.12 → 239.3.1.106  IGMPv2 60 Membership Report group 239.3.1.106
root@r1:~#

This packet capture was made on a LAN-facing interface named br0 in a router running igmpproxy. As seen above, the "Membership Report" message was received 10.010377 seconds after the initial "Membership Query" message while the "Max Resp Time" value in the "Membership Query" messages was 10.0 seconds. This caused igmpproxy to remove the multicast route:

Sent membership query from 192.168.0.1 ifIndex 11 to 224.0.0.1. Delay: 10
Created timeout 11 (#0) - delay 10 secs
(Id:11, Time:10)
Created timeout 12 (#1) - delay 115 secs
(Id:11, Time:10)
(Id:12, Time:115)
About to call timeout 11 (#0)
Aging routes in table.
Removing group 239.3.1.106. Died of old age.
Removed route entry for 239.3.1.106 from table.
Vif bits : 0x00000001
Setting TTL for Vif 0 to 1
Removing MFC: 10.0.2.28 -> 239.3.1.106, InpVIf: 1

One might think that this is perhaps an expected behavior as "Membership Report" message was received later than the "Max Resp Time" of 10.0 seconds, but according to rfc2236 Section 3 the host receiving the "General Membership Query" is allowed to answer at 10.0th second and when adding network/processing delays, then it is IMHO expected that occasionally the "Membership Report" messages will arrive tiny bit later than 10.0 seconds.

In order to troubleshoot this better, I built a small lab with following topology:

h3[ens3] <-> [ens3]gw1[ens4] <-> [ens3]mr1

h3 is a virtual machine in multicast receiver role running socat -u UDP4-RECV:10,ip-add-membership=239.3.5.3:ens3 -. gw1 is a multicast router running igmpproxy version 0.4 with configuration below:

quickleave
phyint ens4 upstream  ratelimit 0  threshold 1
    altnet 10.0.0.0/8
phyint ens3 downstream  ratelimit 0  threshold 1

mr1 is in the multicast sender role and running send(IP(src='10.0.2.28', dst='239.3.5.3')/UDP(sport=10, dport=10), iface='ens3', inter=1./2, loop=1) in scapy which sends UDP datagrams addressed to 239.3.5.3 twice a second via ens3 interface in an endless loop. All three virtual machines are running Debian 10.

I have observed the similar issue on this lab setup. One occurrence can be seen on the screenshot below: igmpproxy_mroute_died_of_old_age

Upper tmux panes are h3 running tcpdump and socat, respectively. Middle panes are src/igmpproxy -d -vv /etc/igmpproxy.conf 2>&1 piped through timestamp input utility and custom packet capture utility. Lower right pane is running the scapy command mentioned above in mr1 and lower left pane is printing a timestamp in gw1 if the multicast route is missing:

while true; do ip mroute | grep -q '(10.0.2.28,239.3.5.3) *Iif: ens4 *Oifs: ens3  State: resolved' || date +'%T.%3N'; done

As highlighted on the screenshot, the multicast route was removed 0.9 seconds before the "Membership Report" message was received. The "Membership Report" message was received 10.153 seconds after the initial "Memberhip Query". However, I have seen this happen even when the delay between the query and the report has been 9.6 seconds. In addition, the multicast route is not removed each time when the delay is >10.0 seconds.

In order to avoid the removal of the multicast route, I relaxed the pselect() timeout:

root@gw1:~/igmpproxy# diff -u src/igmpproxy.c{~,}
--- src/igmpproxy.c~    2023-05-05 01:09:08.551415847 +0300
+++ src/igmpproxy.c     2023-05-05 01:09:27.418989595 +0300
@@ -296,7 +296,7 @@
     int     MaxFD, Rt, secs;
     fd_set  ReadFDS;
     socklen_t dummy = 0;
-    struct  timespec  curtime, lasttime, difftime, tv;
+    struct  timespec  curtime, lasttime, difftime, tv, pselect_timeout;
     // The timeout is a pointer in order to set it to NULL if nessecary.
     struct  timespec  *timeout = &tv;

@@ -339,8 +339,12 @@
         FD_ZERO( &ReadFDS );
         FD_SET( MRouterFD, &ReadFDS );

+       memcpy(&pselect_timeout, timeout, sizeof(struct timespec));
+       pselect_timeout.tv_sec *= 2;
+       my_log(LOG_DEBUG, 0, "Calling pselect() with %ld.%09ld seconds timeout", pselect_timeout.tv_sec, pselect_timeout.tv_nsec);
+
         // wait for input
-        Rt = pselect( MaxFD +1, &ReadFDS, NULL, NULL, timeout, NULL );
+        Rt = pselect( MaxFD +1, &ReadFDS, NULL, NULL, &pselect_timeout, NULL );

         // log and ignore failures
         if( Rt < 0 ) {
root@gw1:~/igmpproxy#

Alternatively, at least in a lab environment, one could send frequent "Membership Report" messages from the multicast receiver host, e.g send(IP(dst="239.3.5.3")/IGMP(type=0x16, gaddr="239.3.5.3"), iface="ens3", inter=1./1, loop=1) in scapy.

Has anyone else observed this behavior? Please let me know if I can provide additional details.

tonusoo avatar May 04 '23 23:05 tonusoo