pimd icon indicating copy to clipboard operation
pimd copied to clipboard

pimd on a nat router

Open ruckc opened this issue 6 years ago • 38 comments

We're trying to use pimd to route multicast into and out of a nat'd subnet. We are successfully getting the data into the subnet, through the nat with no issues, but on egress of the nat, the data gets out, but in the igmp payload on the public network, it says the source ip is the nat'd rfc1918 non routable IP address.

Is there anyway to manipulate the pim/igmp payloads to masquerade the private ip addresses?

ruckc avatar Oct 20 '18 16:10 ruckc

I'm not really sure I fully understand your setup. Maybe you could provide a network topology of sender and receiver?

Generally speaking, there are lots of trap doors and booby traps surrounding multicast and NAT. For instance IGMP is supposed to be link-local only, not to be routed. PIM routers peer using PIM messages on a shared LAN, or over a tunnel (e.g. GRE). IGMP is only for layer-2 on the sender and receiver side.

troglobit avatar Oct 21 '18 10:10 troglobit

mc diagram

Alright, I tried to make a diagram. For the sake of completeness I included the firewalls, but for all intents they are transparent for our multicast/igmp/pim traffic.

Essentially the MC data comes from the WAN, through the security layers in the DMZ, gets modified there, then goes back out of the DMZ into the User LAN for consumption.

So the issue is the MC Sender inside the DMZ, behind the IPTables/NAT/PIMD Router, has an RFC1918 Address. The MC Receivers in the User LAN, receive the IGMP membership query, which contains the RFC1918 private ip address as the Source Address. Since they don't have a unicast route to that address, they fail.

Ideally, I'd like to convince pimd (or use iptables) to modify the igmp packets leaving the IPTables/NAT/PIMD router to the larger network, to masquerade the RFC1918 address that is embedded in the source address field.

ruckc avatar Oct 22 '18 12:10 ruckc

Also it appears the PIM register message is contains the private ip address, when being sent to the core router.

ruckc avatar Oct 22 '18 13:10 ruckc

So, i've created a simulation of the graphic above, with 4 hosts, the core router and nat router both running pimd. In cleanroom testing, on the core-router the rfc1918 address shows up in the "ip mroute" output & "pimd -r" output.

This is setup with running "sockperf server" on the MC server, and running "sockperf ping-pong" on the mc receiver workstation. It appears to function properly, but I believe having the RFC1918 address exposed in the multicast traffic is confusing our more complicated multicast receivers, since they don't have a unicast route to the RFC1918 address.

ruckc avatar Oct 22 '18 18:10 ruckc

I've never attempted anything like you're to do, so I honestly don't know if I can help you. Sorry!

However, personally I'd start by simplifying the problem.

  1. Have a separate pimd router in the User LAN to take care of that LAN's IGMP
  2. Set up static multicast routes + source NAT on the Cisco for the multicast data that's supposed to pass through the DMZ to be modified
  3. Have the Cisco and the User LAN PIM routers talk PIM

That way the local PIM routers can take care of each of their respective LANs, acting as IGMP queriers etc.

troglobit avatar Oct 23 '18 20:10 troglobit

Our biggest complexity is that the multicast groups are externally (by another organization) defined, user defined (by non-administrator users), and change on a regular basis, which is why we are trying to go dynamic.

It appears in my cleanroom lab environment, that everything actually works, pimd on the NAT router is just leaking the RFC1918 addresses. I'm currently trying to understand pimd to see if I could patch it to masquerade the payloads on egress.

ruckc avatar Oct 24 '18 13:10 ruckc

If I understand correctly, the software that takes in charge multicast forwarding is the kernel, not pimd.

So, tweaking the addresses is probably a job for netfilter and iptables or such module (though iptables rules could be installed by pimd, but maybe doing this is outside of its specifications?)

Regards

De : Curtis Ruck [mailto:[email protected]] Envoyé : mercredi 24 octobre 2018 15:26 À : troglobit/pimd Cc : Subscribed Objet : Re: [troglobit/pimd] pimd on a nat router (#126)

Our biggest complexity is that the multicast groups are externally (by another organization) defined, user defined (by non-administrator users), and change on a regular basis, which is why we are trying to go dynamic.

It appears in my cleanroom lab environment, that everything actually works, pimd on the NAT router is just leaking the RFC1918 addresses. I'm currently trying to understand pimd to see if I could patch it to masquerade the payloads on egress.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/troglobit/pimd/issues/126#issuecomment-432655058, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AcU713tc1S32jDLlZDdVNUz_IsnId9A8ks5uoGp0gaJpZM4XyBa8.

jp-t avatar Oct 24 '18 14:10 jp-t

@jp-t my understanding was pimd was the network service actually receiving and transmitting the pim messages. (Ref pim.c send_pim())

I'm researching the various PIM payloads. I know the PIM register payload contains the source address, and it appears send_pim() sets them in "sin.sin_addr.s_addr = dst;"

ruckc avatar Oct 24 '18 16:10 ruckc

Additionally, the kernel appears to be getting a little confused. with "ip mroute" it shows the "Iif=pimreg" for the multicast group coming from inside the NAT, and instead adds the inbound interface to Oif.

ruckc avatar Oct 24 '18 18:10 ruckc

Address translation was really never meant to be done inside of each routing application, like pimd. I've been trying to read up on how Cisco handles cases like this, with a DMZ, but every application I've seen so far has a dedicated PIM router for each LAN.

The kernel is not confused. The 'pimreg' interface is an actual interface created for PIM-SM as the register tunnel. It is used between PIM routers to forward multicast streams to each respective Rendez-vous Point in the network. I'd suggest, from your topology, to set up the Cisco as a static RP for your network.

troglobit avatar Oct 25 '18 06:10 troglobit

We are setting the Cisco as a static RP, but in our test environments we are using a linux based router w/ pimd as the static RP as our "core" router.

I'm going through the process of getting a repeatable build, and i'm adding a ton of logging to pimd for various odd things we are seeing.

ruckc avatar Oct 25 '18 13:10 ruckc

Ok, so progress (i think).

In send_pim_register, since my pimd router is dual homed, when i tries to send PIM register to the RP, it chooses the wrong source IP. It pulls source ip from reg_src = uvifs[vifi].uv_lcl_addr; which contains the wrong network IP.

image

When pimd running on mc-dmz-nat, is trying to send a PIM register to the RP (mc-core), the source IP it chooses is 10.0.1.1, not 192.168.10.2. The kernel then returns an EPERM (-1) on the sendto syscall. I have no firewall DENY/REJECTs in my iptables configuration at this point, and i've turned on Linux's auditd to trace the syscalls. The green multicast stream is getting to the mc-server, but the red stream isn't making it to the user.

I'm slowly peeling back the onion on this... if you (@troglobit) don't mind I'd like to keep this going/open as I'm learning the process, in case it jogs something that may help us.

ruckc avatar Oct 25 '18 21:10 ruckc

@ruckc peel back all the layers of the onion, I don't mind keeping the discussion open here. It would be great if we could use this issue to document caveats around NAT and PIM. Good luck! :)

troglobit avatar Oct 26 '18 18:10 troglobit

So, I have the PIM Register packets getting masqueraded. I added a new configuration parameter (can be used multiple times):

private-network 10.0.0.0/24 masquerade 192.168.10.2

This currently modifies the ip struct in pim_proto.c's send_pim_register, if the ip_src->s_addr is inside the CIDR subnet, it changes the ip->ip_src->s_addr to the provided masquerade IP.

The next layer was modifying the resultant PIM Register Stop from the RP, before processing it, as the mc-dmz-nat was ignoring the "invalid" STOP message, since it didn't have an accurate (S,G) pairing. So I added a very very poor implementation of a connection tracking table, so the send_pim_register records an entry in the mapping table, and the receive_pim_register_stop modifies the S in the (S,G) pair in the stop message, back to the original source value. I have a TODO currently to improve tracking table management, but currently it just adds a new entry to an array for each new PIM Register (S,G) set.

Now, the packets are getting received by the mc-core router (RP/BSR), but somehow the PIMD on the mc-dmz-nat machine is sending a PRUNE to the mc-core (RP/BSR) with the real 10.0.0.0 source IP in it. In trying to trace this down, I can't figure out how these prunes are getting generated inside pimd, especially since grep and cscope are failing me at finding the definition of send_pim_join_prune.

*edit, I changed DR to RP... i've got mc-core configured as a RP/BSR.

ruckc avatar Oct 27 '18 01:10 ruckc

So I found send_jp_message but my C is not very strong, and the C voodoo is fairly heavy. Can anyone explain how pimd knows when to send a PRUNE to the DR?

Also, forgot to mention, mc-core when it gets the prune, it breaks, because it can't find a unicast route back to mc-dmz-nat.

Namely the debug on mc-core looks like this:

Received PIM JOIN/PRUNE from 192.168.10.2 on eth1
Received PIM JOIN from 192.168.10.2 to group 224.1.1.5 for multicast source 192.168.23.81 on eth1
Received PIM PRUNE from 192.168.10.2 to group 224.1.1.5 for multicast source 10.0.1.44 on eth1
find_route: No (S,G) entry. Return the (*,G) entry for 224.1.1.5
find_route:(S,G) entry not found for source 10.0.1.44 and group 224.1.1.5
find_route: No SG|WC, return NULL
find_route: No (S,G) entry. Return the (*,G) entry for 224.1.1.5
NETLINK: ask path to 10.0.1.44
NETLINK: vif 0, ifindex=2
NETLINK: gateway is 192.168.23.1
For src 10.0.1.44, iff is 0, next hop router is 192.168.23.1: NOT A PIM ROUTER

Which basically tells me that since the PRUNE contains 10.0.1.44 instead of my masqueraded source, the DR can't locate the (S,G) and it gives up doing whatever it needs to do after receiving the PRUNE.

I just can't figure out how that source (10.0.1.44) is making its way into the JOIN/PRUNE messages.

ruckc avatar Oct 27 '18 01:10 ruckc

A DR sends a prune towards the RP when there are no more local receivers. I think the code path you're looking for is here https://github.com/troglobit/pimd/blob/master/src/pim_proto.c#L2529

The source 10.0.1.44 is added there.

troglobit avatar Oct 27 '18 07:10 troglobit

That looks very promising, logit call first...

ruckc avatar Oct 27 '18 16:10 ruckc

So, that is wired up to masquerade that source field in the jp messages, and the JOIN/PRUNE messages are coming across properly.

On to the next layer of the onion. send_pim_null_register was emitting the 10.0.1.44 ip address. Patched that hole, now the mc-core isn't seeing any 10.0.1.44 addresses, but it also isn't registering the multicast route either for it. I have the mc-user machine sending an IGMPv3 Join to 224.1.1.5, but I believe since pimd on mc-core isn't showing the the multicast route in ip mroute it isn't forwarding the multicast traffic even though it is receiving the full PIMREG unicast stream from mc-dmz-nat, with no errors being reported on mc-core

ruckc avatar Oct 27 '18 17:10 ruckc

So before starting to modify pimd, mc-core was seeing the multicast route, with an "Iif: pimreg". Now it's not registering the multicast route. I'm guessing this is because the new "source" of the multicast route is 192.168.10.2 which is "local" to the mc-core.

Interesting things to note, when it receives the PIM Register (w/ payload) it find_route is successful at finding the (S,G) for (192.168.10.2,224.1.1.5), but it is logging No output interfaces found for group 224.1.1.5 source 192.168.10.2, which then sends a PIM REGISTER STOP back to mc-dmz-nat,

ruckc avatar Oct 27 '18 18:10 ruckc

So, final thoughts for the day. How does the multicast traffic embedded in a PIM Register message, get sent back out to multicast consumers that have joined the multicast group?

ruckc avatar Oct 27 '18 18:10 ruckc

The PIM register tunnel is only an affair between PIM routers, on the edge towards consumers the kernel forwards multicast based on the IGMP join received on the interface. So you should be able to handle NAT:ing of the multicast data in the kernel.

troglobit avatar Oct 28 '18 14:10 troglobit

So it looks then like i have to figure out why in the mc-core pimd, receive_pim_register doesn't output the packet out the Oif interfaces...

ruckc avatar Oct 28 '18 17:10 ruckc

Well, pimd only decapsulates the frames, think of it as a tunnel endpoint. It's up to the kernel to actually forward the multicast to the Oif. Have you checked the rp_filter setting in the kernel?

troglobit avatar Oct 28 '18 18:10 troglobit

So, looking into the receive_pim_register method, it appears, the mc-core is sending a PIM register stop message back to mc-dmz-nat because the SPT bit is set, which would be why the traffic isn't getting sent out the Oif.

I re-encountered the issues with EPERM (-1) result on the sendto syscall in send_pim_register, so i'm now masquerading the reg_src value to the same as the masquerade_ip for the private network. The reg_src value is getting set to the 10.0.1.1, on the private side of the mc-dmz-nat. I don't think this is necessary as linux is supposed to route & NAT it properly, but making the change got the packets to flow out of the mc-dmz-nat without the EPERM errors.

ruckc avatar Oct 29 '18 16:10 ruckc

So, yes. The PIM REGISTER packets are coming in to mc-core from the mc-dmz-nat, and the mrtentry created has flags 0x2001, which are both MRTF_SG and MRTF_SPT flags, which causes receive_pim_register to send the PIM REGISTER STOP back to mc-dmz-nat.

When I turn on the mc-user's subscription to the 224.1.1.5 group, i see the IGMP membership report hit mc-core and it creates the vif for the group, but it never shows up in ip mroute.

I also see the mc-dmz-nat send PIM JOIN/PRUNE messages that log JOIN then PRUNE to group 224.1.1.5. I'm checking the order of operations, if JOIN happens first or if PRUNE happens first.

ruckc avatar Oct 29 '18 16:10 ruckc

Odd, now today, the pimreg tunnel comes up, the PIM REGISTER packets find an mrtentry with 0x0206 flags which would be MRTF_WC, MRTF_RP, MRTF_KERNEL_CACHE.

It looks like the packets are picking up a (*,G) route, and ip mroute is showing both 192.168.10.1 and 192.168.11.1 interfaces as Oif.

How does the PIM Register payload get from receive_pim_register to the kernel for routing to the multicast consumers? I've traced through the entire receive_pim_register but I'm not seeing it hand off the ip payload to anything.

Should the kernel be receiving the PIM register messages also, and all pimd is doing is configuring the routing of those messages?

So i've got rp_filter=0, ip_forward=1, all ethX/mc_forwarding=1, ethX/forwarding=1

I have no clue how to debug/troubleshoot the kernel if it isn't routing the pimreg traffic properly...

ruckc avatar Oct 30 '18 12:10 ruckc

So, using tcpdump, i've verified the packets are making it in from the pimreg interface (tcpdump -i pimreg), but it isn't routing it anywhere. I'm not running any standard routing daemons, i've only ever used iptables for static routing with FORWARD rules. I guess it's time to find out how to do it the right way.

ruckc avatar Oct 30 '18 19:10 ruckc

Hopefully my last issue, it appears the mc-core kernel may be dropping the packets because the checksums are invalid... i've updated things to recalculate the ip->ip_sum, which moved the problem to udp checksum. Now to figure out how to recalculate that, plenty of stackoverflow examples... i'd just there would be a predefined simple C function lying around somewhere in glibc or kernel sources...

ruckc avatar Nov 02 '18 03:11 ruckc

IP header checksum is required, but the UDP header checksum is optional. Should be sufficient to set it to zero. However, here's a function we've used internally at work:

/**
 * in_cksum - Checksum routine for Internet Protocol family headers
 * @addr: Pointer to buffer to checksum
 * @len:  Length of buffer
 *
 * Returns:
 * Computed checksum.
 */
unsigned short in_cksum (unsigned short *addr, int len)
{
   register int sum = 0;
   u_short answer = 0;
   register u_short *w = addr;
   register int nleft = len;

   /*
    * Our algorithm is simple, using a 32 bit accumulator (sum), we add
    * sequential 16 bit words to it, and at the end, fold back all the
    * carry bits from the top 16 bits into the lower 16 bits.
    */
   while (nleft > 1)
   {
      sum += *w++;
      nleft -= 2;
   }

   /* mop up an odd byte, if necessary */
   if (nleft == 1)
   {
      *(u_char *) (&answer) = *(u_char *) w;
      sum += answer;
   }

   /* add back carry outs from top 16 bits to low 16 bits */
   sum = (sum >> 16) + (sum & 0xffff);  /* add hi 16 to low 16 */
   sum += (sum >> 16);           /* add carry */
   answer = ~sum;                /* truncate to 16 bits */
   return (answer);
}

troglobit avatar Nov 02 '18 11:11 troglobit

I only noticed it based on tcpdump complaining about the multicast data coming out the pimreg interface.

@troglobit that is basically the same as pimd's inet.c inet_cksum function, and I never knew udp checksumming was optional, so i'm going to try zeroing it for now, with longer term trying to calculate it.

ruckc avatar Nov 02 '18 12:11 ruckc