pimd
pimd copied to clipboard
pimd on a nat router
We're trying to use pimd to route multicast into and out of a nat'd subnet. We are successfully getting the data into the subnet, through the nat with no issues, but on egress of the nat, the data gets out, but in the igmp payload on the public network, it says the source ip is the nat'd rfc1918 non routable IP address.
Is there anyway to manipulate the pim/igmp payloads to masquerade the private ip addresses?
I'm not really sure I fully understand your setup. Maybe you could provide a network topology of sender and receiver?
Generally speaking, there are lots of trap doors and booby traps surrounding multicast and NAT. For instance IGMP is supposed to be link-local only, not to be routed. PIM routers peer using PIM messages on a shared LAN, or over a tunnel (e.g. GRE). IGMP is only for layer-2 on the sender and receiver side.
Alright, I tried to make a diagram. For the sake of completeness I included the firewalls, but for all intents they are transparent for our multicast/igmp/pim traffic.
Essentially the MC data comes from the WAN, through the security layers in the DMZ, gets modified there, then goes back out of the DMZ into the User LAN for consumption.
So the issue is the MC Sender inside the DMZ, behind the IPTables/NAT/PIMD Router, has an RFC1918 Address. The MC Receivers in the User LAN, receive the IGMP membership query, which contains the RFC1918 private ip address as the Source Address. Since they don't have a unicast route to that address, they fail.
Ideally, I'd like to convince pimd (or use iptables) to modify the igmp packets leaving the IPTables/NAT/PIMD router to the larger network, to masquerade the RFC1918 address that is embedded in the source address field.
Also it appears the PIM register message is contains the private ip address, when being sent to the core router.
So, i've created a simulation of the graphic above, with 4 hosts, the core router and nat router both running pimd. In cleanroom testing, on the core-router the rfc1918 address shows up in the "ip mroute" output & "pimd -r" output.
This is setup with running "sockperf server" on the MC server, and running "sockperf ping-pong" on the mc receiver workstation. It appears to function properly, but I believe having the RFC1918 address exposed in the multicast traffic is confusing our more complicated multicast receivers, since they don't have a unicast route to the RFC1918 address.
I've never attempted anything like you're to do, so I honestly don't know if I can help you. Sorry!
However, personally I'd start by simplifying the problem.
- Have a separate pimd router in the User LAN to take care of that LAN's IGMP
- Set up static multicast routes + source NAT on the Cisco for the multicast data that's supposed to pass through the DMZ to be modified
- Have the Cisco and the User LAN PIM routers talk PIM
That way the local PIM routers can take care of each of their respective LANs, acting as IGMP queriers etc.
Our biggest complexity is that the multicast groups are externally (by another organization) defined, user defined (by non-administrator users), and change on a regular basis, which is why we are trying to go dynamic.
It appears in my cleanroom lab environment, that everything actually works, pimd on the NAT router is just leaking the RFC1918 addresses. I'm currently trying to understand pimd to see if I could patch it to masquerade the payloads on egress.
If I understand correctly, the software that takes in charge multicast forwarding is the kernel, not pimd.
So, tweaking the addresses is probably a job for netfilter and iptables or such module (though iptables rules could be installed by pimd, but maybe doing this is outside of its specifications?)
Regards
De : Curtis Ruck [mailto:[email protected]] Envoyé : mercredi 24 octobre 2018 15:26 À : troglobit/pimd Cc : Subscribed Objet : Re: [troglobit/pimd] pimd on a nat router (#126)
Our biggest complexity is that the multicast groups are externally (by another organization) defined, user defined (by non-administrator users), and change on a regular basis, which is why we are trying to go dynamic.
It appears in my cleanroom lab environment, that everything actually works, pimd on the NAT router is just leaking the RFC1918 addresses. I'm currently trying to understand pimd to see if I could patch it to masquerade the payloads on egress.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/troglobit/pimd/issues/126#issuecomment-432655058, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AcU713tc1S32jDLlZDdVNUz_IsnId9A8ks5uoGp0gaJpZM4XyBa8.
@jp-t my understanding was pimd was the network service actually receiving and transmitting the pim messages. (Ref pim.c send_pim())
I'm researching the various PIM payloads. I know the PIM register payload contains the source address, and it appears send_pim() sets them in "sin.sin_addr.s_addr = dst;"
Additionally, the kernel appears to be getting a little confused. with "ip mroute" it shows the "Iif=pimreg" for the multicast group coming from inside the NAT, and instead adds the inbound interface to Oif.
Address translation was really never meant to be done inside of each routing application, like pimd. I've been trying to read up on how Cisco handles cases like this, with a DMZ, but every application I've seen so far has a dedicated PIM router for each LAN.
The kernel is not confused. The 'pimreg' interface is an actual interface created for PIM-SM as the register tunnel. It is used between PIM routers to forward multicast streams to each respective Rendez-vous Point in the network. I'd suggest, from your topology, to set up the Cisco as a static RP for your network.
We are setting the Cisco as a static RP, but in our test environments we are using a linux based router w/ pimd as the static RP as our "core" router.
I'm going through the process of getting a repeatable build, and i'm adding a ton of logging to pimd for various odd things we are seeing.
Ok, so progress (i think).
In send_pim_register, since my pimd router is dual homed, when i tries to send PIM register to the RP, it chooses the wrong source IP. It pulls source ip from reg_src = uvifs[vifi].uv_lcl_addr;
which contains the wrong network IP.
When pimd running on mc-dmz-nat, is trying to send a PIM register to the RP (mc-core), the source IP it chooses is 10.0.1.1, not 192.168.10.2. The kernel then returns an EPERM (-1) on the sendto syscall. I have no firewall DENY/REJECTs in my iptables configuration at this point, and i've turned on Linux's auditd to trace the syscalls. The green multicast stream is getting to the mc-server, but the red stream isn't making it to the user.
I'm slowly peeling back the onion on this... if you (@troglobit) don't mind I'd like to keep this going/open as I'm learning the process, in case it jogs something that may help us.
@ruckc peel back all the layers of the onion, I don't mind keeping the discussion open here. It would be great if we could use this issue to document caveats around NAT and PIM. Good luck! :)
So, I have the PIM Register packets getting masqueraded. I added a new configuration parameter (can be used multiple times):
private-network 10.0.0.0/24 masquerade 192.168.10.2
This currently modifies the ip struct in pim_proto.c
's send_pim_register
, if the ip_src->s_addr is inside the CIDR subnet, it changes the ip->ip_src->s_addr
to the provided masquerade IP.
The next layer was modifying the resultant PIM Register Stop from the RP, before processing it, as the mc-dmz-nat
was ignoring the "invalid" STOP message, since it didn't have an accurate (S,G) pairing. So I added a very very poor implementation of a connection tracking table, so the send_pim_register
records an entry in the mapping table, and the receive_pim_register_stop
modifies the S in the (S,G) pair in the stop message, back to the original source value. I have a TODO currently to improve tracking table management, but currently it just adds a new entry to an array for each new PIM Register (S,G) set.
Now, the packets are getting received by the mc-core
router (RP/BSR), but somehow the PIMD on the mc-dmz-nat
machine is sending a PRUNE to the mc-core
(RP/BSR) with the real 10.0.0.0 source IP in it. In trying to trace this down, I can't figure out how these prunes are getting generated inside pimd, especially since grep and cscope are failing me at finding the definition of send_pim_join_prune.
*edit, I changed DR to RP... i've got mc-core
configured as a RP/BSR.
So I found send_jp_message
but my C is not very strong, and the C voodoo is fairly heavy. Can anyone explain how pimd knows when to send a PRUNE to the DR?
Also, forgot to mention, mc-core
when it gets the prune, it breaks, because it can't find a unicast route back to mc-dmz-nat
.
Namely the debug on mc-core
looks like this:
Received PIM JOIN/PRUNE from 192.168.10.2 on eth1
Received PIM JOIN from 192.168.10.2 to group 224.1.1.5 for multicast source 192.168.23.81 on eth1
Received PIM PRUNE from 192.168.10.2 to group 224.1.1.5 for multicast source 10.0.1.44 on eth1
find_route: No (S,G) entry. Return the (*,G) entry for 224.1.1.5
find_route:(S,G) entry not found for source 10.0.1.44 and group 224.1.1.5
find_route: No SG|WC, return NULL
find_route: No (S,G) entry. Return the (*,G) entry for 224.1.1.5
NETLINK: ask path to 10.0.1.44
NETLINK: vif 0, ifindex=2
NETLINK: gateway is 192.168.23.1
For src 10.0.1.44, iff is 0, next hop router is 192.168.23.1: NOT A PIM ROUTER
Which basically tells me that since the PRUNE contains 10.0.1.44
instead of my masqueraded source, the DR can't locate the (S,G) and it gives up doing whatever it needs to do after receiving the PRUNE.
I just can't figure out how that source (10.0.1.44) is making its way into the JOIN/PRUNE messages.
A DR sends a prune towards the RP when there are no more local receivers. I think the code path you're looking for is here https://github.com/troglobit/pimd/blob/master/src/pim_proto.c#L2529
The source 10.0.1.44 is added there.
That looks very promising, logit call first...
So, that is wired up to masquerade that source
field in the jp messages, and the JOIN/PRUNE messages are coming across properly.
On to the next layer of the onion. send_pim_null_register
was emitting the 10.0.1.44
ip address. Patched that hole, now the mc-core
isn't seeing any 10.0.1.44
addresses, but it also isn't registering the multicast route either for it. I have the mc-user
machine sending an IGMPv3 Join to 224.1.1.5, but I believe since pimd
on mc-core
isn't showing the the multicast route in ip mroute
it isn't forwarding the multicast traffic even though it is receiving the full PIMREG unicast stream from mc-dmz-nat
, with no errors being reported on mc-core
So before starting to modify pimd, mc-core
was seeing the multicast route, with an "Iif: pimreg". Now it's not registering the multicast route. I'm guessing this is because the new "source" of the multicast route is 192.168.10.2
which is "local" to the mc-core
.
Interesting things to note, when it receives the PIM Register (w/ payload) it find_route
is successful at finding the (S,G) for (192.168.10.2,224.1.1.5), but it is logging No output interfaces found for group 224.1.1.5 source 192.168.10.2
, which then sends a PIM REGISTER STOP back to mc-dmz-nat
,
So, final thoughts for the day. How does the multicast traffic embedded in a PIM Register message, get sent back out to multicast consumers that have joined the multicast group?
The PIM register tunnel is only an affair between PIM routers, on the edge towards consumers the kernel forwards multicast based on the IGMP join received on the interface. So you should be able to handle NAT:ing of the multicast data in the kernel.
So it looks then like i have to figure out why in the mc-core
pimd, receive_pim_register doesn't output the packet out the Oif
interfaces...
Well, pimd only decapsulates the frames, think of it as a tunnel endpoint. It's up to the kernel to actually forward the multicast to the Oif. Have you checked the rp_filter
setting in the kernel?
So, looking into the receive_pim_register
method, it appears, the mc-core
is sending a PIM register stop message back to mc-dmz-nat
because the SPT bit is set, which would be why the traffic isn't getting sent out the Oif.
I re-encountered the issues with EPERM (-1) result on the sendto
syscall in send_pim_register
, so i'm now masquerading the reg_src
value to the same as the masquerade_ip for the private network. The reg_src value is getting set to the 10.0.1.1
, on the private side of the mc-dmz-nat
. I don't think this is necessary as linux is supposed to route & NAT it properly, but making the change got the packets to flow out of the mc-dmz-nat
without the EPERM errors.
So, yes. The PIM REGISTER packets are coming in to mc-core
from the mc-dmz-nat
, and the mrtentry
created has flags 0x2001
, which are both MRTF_SG
and MRTF_SPT
flags, which causes receive_pim_register
to send the PIM REGISTER STOP back to mc-dmz-nat
.
When I turn on the mc-user
's subscription to the 224.1.1.5
group, i see the IGMP membership report hit mc-core
and it creates the vif for the group, but it never shows up in ip mroute
.
I also see the mc-dmz-nat
send PIM JOIN/PRUNE messages that log JOIN then PRUNE to group 224.1.1.5
. I'm checking the order of operations, if JOIN happens first or if PRUNE happens first.
Odd, now today, the pimreg tunnel comes up, the PIM REGISTER packets find an mrtentry
with 0x0206 flags which would be MRTF_WC, MRTF_RP, MRTF_KERNEL_CACHE.
It looks like the packets are picking up a (*,G) route, and ip mroute
is showing both 192.168.10.1 and 192.168.11.1 interfaces as Oif.
How does the PIM Register payload get from receive_pim_register
to the kernel for routing to the multicast consumers? I've traced through the entire receive_pim_register
but I'm not seeing it hand off the ip
payload to anything.
Should the kernel be receiving the PIM register messages also, and all pimd is doing is configuring the routing of those messages?
So i've got rp_filter=0, ip_forward=1, all ethX/mc_forwarding=1, ethX/forwarding=1
I have no clue how to debug/troubleshoot the kernel if it isn't routing the pimreg traffic properly...
So, using tcpdump, i've verified the packets are making it in from the pimreg interface (tcpdump -i pimreg
), but it isn't routing it anywhere. I'm not running any standard routing daemons, i've only ever used iptables for static routing with FORWARD rules. I guess it's time to find out how to do it the right way.
Hopefully my last issue, it appears the mc-core
kernel may be dropping the packets because the checksums are invalid... i've updated things to recalculate the ip->ip_sum
, which moved the problem to udp checksum. Now to figure out how to recalculate that, plenty of stackoverflow examples... i'd just there would be a predefined simple C function lying around somewhere in glibc or kernel sources...
IP header checksum is required, but the UDP header checksum is optional. Should be sufficient to set it to zero. However, here's a function we've used internally at work:
/**
* in_cksum - Checksum routine for Internet Protocol family headers
* @addr: Pointer to buffer to checksum
* @len: Length of buffer
*
* Returns:
* Computed checksum.
*/
unsigned short in_cksum (unsigned short *addr, int len)
{
register int sum = 0;
u_short answer = 0;
register u_short *w = addr;
register int nleft = len;
/*
* Our algorithm is simple, using a 32 bit accumulator (sum), we add
* sequential 16 bit words to it, and at the end, fold back all the
* carry bits from the top 16 bits into the lower 16 bits.
*/
while (nleft > 1)
{
sum += *w++;
nleft -= 2;
}
/* mop up an odd byte, if necessary */
if (nleft == 1)
{
*(u_char *) (&answer) = *(u_char *) w;
sum += answer;
}
/* add back carry outs from top 16 bits to low 16 bits */
sum = (sum >> 16) + (sum & 0xffff); /* add hi 16 to low 16 */
sum += (sum >> 16); /* add carry */
answer = ~sum; /* truncate to 16 bits */
return (answer);
}
I only noticed it based on tcpdump complaining about the multicast data coming out the pimreg
interface.
@troglobit that is basically the same as pimd's inet.c inet_cksum function, and I never knew udp checksumming was optional, so i'm going to try zeroing it for now, with longer term trying to calculate it.