pdns icon indicating copy to clipboard operation
pdns copied to clipboard

dnsdist: add AF_XDP support for UDP

Open Y7n05h opened this issue 2 years ago • 51 comments

Signed-off-by: Y7n05h [email protected]

Short description

Added partial AF_XDP Socket (aka xsk) support for dnsdist's UDP client

Checklist

I have:

  • [x] read the CONTRIBUTING.md document
  • [x] compiled this code
  • [x] tested this code
  • [ ] included documentation (including possible behaviour changes)
  • [ ] documented the code
  • [ ] added or modified regression test(s)
  • [ ] added or modified unit test(s)

In order to maximize the advantages of xsk, I have tried to avoid the copying of frame to PacketBuffer. But this attempt was unsuccessful, because PacketBuffer is required by so much code that it is very difficult to change the implementation of PacketBuffer.

Y7n05h avatar May 28 '22 09:05 Y7n05h

This pr is currently just used to show the idea of my implementation. Adding xsk support to dnsdist would be quite complicated, I hope anyone can give me some good advice, if possible.

Y7n05h avatar May 28 '22 09:05 Y7n05h

Can someone tell me what cmsghdr is used for? In my understanding, cmsghdr can only be used for unix domain sockets, not for the Internet. But in dnsdist, it appears in multiple scenarios related to tcp and udp. Am I understanding something wrong?

https://github.com/PowerDNS/pdns/blob/5744e93f8726155dc477871521afcfcf998df212/pdns/dnsdist.cc#L1008-L1039

https://github.com/PowerDNS/pdns/blob/5744e93f8726155dc477871521afcfcf998df212/pdns/dnsdist.cc#L1434

Y7n05h avatar Jun 01 '22 15:06 Y7n05h

Can someone tell me what cmsghdr is used for? In my understanding, cmsghdr can only be used for unix domain sockets, not for the Internet. But in dnsdist, it appears in multiple scenarios related to tcp and udp. Am I understanding something wrong?

Yes, sendmsg and recvmsg can be used with tcp and udp. One purpose is to supply an iovec array to send or receive using multiple (non-contiguous) buffers. Another use is to set or get various associated parameters, like the source address used for outgoing UDP requests. That is what happening in this case. See e.g. http://man.openbsd.org/ip.4 for some other uses.

omoerbeek avatar Jun 01 '22 15:06 omoerbeek

Yes, sendmsg and recvmsg can be used with tcp and udp. One purpose is to supply an iovec array to send or receive using multiple (non-contiguous) buffers. Another use is to set or get various associated parameters, like the source address used for outgoing UDP requests. That is what happening in this case. See e.g. http://man.openbsd.org/ip.4 for some other uses.

Thanks for the quick reply. I fully understand being able to use iovec in tcp udp via sendmsg and recvmsg and being able to get source address information with them, but what I'm wondering is whether msghdr.msg_control has any practical meaning? Because they don't appear in the Ethernet payload.

Y7n05h avatar Jun 01 '22 15:06 Y7n05h

what I'm wondering is whether msghdr.msg_control has any practical meaning? Because they don't appear in the Ethernet payload.

We use the header to communicate with the kernel, mostly for the purpose of getting or setting the source address, as you have already found out. So that information doesn't end up in the Ethernet payload directly, but it does influence what the kernel is doing with our datagram.

rgacogne avatar Jun 01 '22 15:06 rgacogne

We use the header to communicate with the kernel, mostly for the purpose of getting or setting the source address, as you have already found out. So that information doesn't end up in the Ethernet payload directly, but it does influence what the kernel is doing with our datagram.

Okay, I understand. Thank you very much.

Y7n05h avatar Jun 01 '22 16:06 Y7n05h

Setting the source address for outgoing traffic does influence what's on the wire, as the IP header contains both source and destination IP address. There are several ways to set the source address used by a socket: not doing it (the kernel will select one based on the destination and routing table), using bind(2) to set a specific one used for all outgoing packets, or on a per-packet basis. That latter method is used here.

omoerbeek avatar Jun 01 '22 16:06 omoerbeek

Sorry, I just committed and pushed something by mistake. I've now got xsk support for UDP mostly done, but all the code has not been tested yet and I will finish testing it in the next few days.

Y7n05h avatar Jun 08 '22 17:06 Y7n05h

One thing I might want to clarify: I used and partially modified some of the code from https://github.com/google/packetdrill/blob/master/gtests/net/packetdrill/packet_checksum.c . https://github.com/PowerDNS/pdns/blob/a8a5943a748e70abb96830b1c3a92ef773adf752/pdns/dnsdistdist/xsk.cc#L492-L614 They are under GPLv2 or later. Dnsdist is GPLv3 or later.So they are compatible.

Y7n05h avatar Jun 08 '22 18:06 Y7n05h

I've now got xsk support for UDP mostly done, but all the code has not been tested yet and I will finish testing it in the next few days.

That's awesome! Let me know when you think this is ready for a review and I'll try to do it quickly. It does not have to be 100% finished at this point, to be clear.

rgacogne avatar Jun 09 '22 07:06 rgacogne

That's awesome! Let me know when you think this is ready for a review and I'll try to do it quickly. It does not have to be 100% finished at this point, to be clear.

This pr has a little bit of work left before it gets reviewed, and I'll finish it before I start tomorrow. But before that, could you please review #11526 first? The changes I made to the BPF program are based on #11526 .

Y7n05h avatar Jun 09 '22 10:06 Y7n05h

I tested the code and I found that it doesn't work as I expected here. I was hoping to get std::shared_ptr<XskSocket>, but I got std::function<std::tuple<DNSName, unsigned short, unsigned short> (DNSName const&, unsigned short, unsigned short, dnsheader*)>

https://github.com/PowerDNS/pdns/blob/9cc6fe80f0ad6fb88bb69fb146fd35f3eedfa6c7/pdns/dnsdist-lua.cc#L109 https://github.com/PowerDNS/pdns/blob/9cc6fe80f0ad6fb88bb69fb146fd35f3eedfa6c7/pdns/dnsdist-lua.cc#L140-L148

I don't understand lua and don't quite understand how to make this work as I expect.

Y7n05h avatar Jun 13 '22 07:06 Y7n05h

Can you show us the Lua code you used to get that result? The LuaWrapper class we use is not always easy to understand, but we should be able to help you with that.

rgacogne avatar Jun 13 '22 08:06 rgacogne

Can you show us the Lua code you used to get that result? The LuaWrapper class we use is not always easy to understand, but we should be able to help you with that.

I have not changed any Lua files. All the files I used for testing (except for the dnsdist config file) I have committed. I just print the type information for (*vars)["xskSocket"] and found them to be inconsistent with what I expected. https://github.com/PowerDNS/pdns/blob/9cc6fe80f0ad6fb88bb69fb146fd35f3eedfa6c7/pdns/dnsdist-lua.cc#L146

My config file is similar to below.

addCapabilitiesToRetain({"CAP_NET_BIND_SERVICE","CAP_BPF","CAP_NET_RAW","CAP_SYS_ADMIN"})
xsk=newXsk({NIC_queue_id=0,frameNums=65536,ifName="wlan0"})
newServer({address="1.1.1.1",MACAddr="aa:aa:aa:aa:aa:aa",xskSocket=xsk})
newServer({address="1.0.0.1",MACAddr="aa:aa:aa:aa:aa:aa",xskSocket=xsk})
addACL("192.168.30.0/24")
setServerPolicy(firstAvailable)

I want to pass a std::shared_ptr<XskSocket> via xskSocket=xsk.

Y7n05h avatar Jun 13 '22 09:06 Y7n05h

My config file is similar to below.

Thanks, that's what I meant since our configuration is written in Lua :) I'll do some tests in a couple hours.

rgacogne avatar Jun 13 '22 09:06 rgacogne

By the way, what is the preferred way to handle TCP packets?

  • One option is to use a userspace network stack, e.g. mTCP, f-stack, something like them. After bypassing the kernel network stack, using a userspace network stack seems to be a popular solution among DPDK users.
  • Or we don't use the userspace network stack and just manually parse TCP packets and selectively support a subset of TCP. (knot uses this method. )

Implementing the full TCP protocol manually is very difficult, and testing the correctness of the implementation may be more difficult than implementing it.

Oh yeah, and the two options above are not just for TCP, the above discussion also applies to ip options and ipv6 Extension headers, which I don't support in the current code.

Y7n05h avatar Jun 13 '22 09:06 Y7n05h

Do you have any rough idea of how much work each option represents? I don't mind adding another dependency for AF_XDP support, provided that the license is compatible with our GPLv2, so if it gives us better TCP support and is easier that's fine by me. On the other hand manually handling a subset of TCP is also fine, and we can add better support via a library later if that makes sense.

rgacogne avatar Jun 13 '22 10:06 rgacogne

Do you have any rough idea of how much work each option represents? I don't mind adding another dependency for AF_XDP support, provided that the license is compatible with our GPLv2, so if it gives us better TCP support and is easier that's fine by me. On the other hand manually handling a subset of TCP is also fine, and we can add better support via a library later if that makes sense.

I'm not quite sure yet. I am trying both scenarios separately. I may be a few days away from getting the answer. But maybe don't worry about the license issue, both mTCP and f-stack use the BSD license. (I remember BSD is compatible with GPLv2. But I'm not very sure about this, because there are too many variants of BSD)

Y7n05h avatar Jun 13 '22 10:06 Y7n05h

My config file is similar to below.

Thanks, that's what I meant since our configuration is written in Lua :) I'll do some tests in a couple hours.

I can reproduce the issue. With the config you provided it happens on newServer(), and we get a DownstreamState::checkfunc_t instead of a std::shared_ptr<XskSocket> which seems to be a LuaWrapper bug (or limitation). I'll keep digging.

rgacogne avatar Jun 13 '22 11:06 rgacogne

So the bad news is that I don't understand why Lua thinks that our custom shared_ptr<XskSocket> is a function (lua_isfunction() return true), and it messes up the type deduction.

The good news is that you can just work around that by defining your type before the function type, so that the type deduction tries it first:

-  typedef LuaAssociativeTable<boost::variant<bool, std::string, LuaArray<std::string>, DownstreamState::checkfunc_t,std::shared_ptr<XskSocket>>> newserver_t;
+  typedef LuaAssociativeTable<boost::variant<bool, std::string, LuaArray<std::string>, std::shared_ptr<XskSocket>, DownstreamState::checkfunc_t>> newserver_t;

and it should work well enough for now.

rgacogne avatar Jun 13 '22 16:06 rgacogne

I have tested dnsdist on my LAN and it works fine so far. Here is the config file I used in my testing :

addCapabilitiesToRetain({"CAP_NET_BIND_SERVICE","CAP_BPF","CAP_NET_RAW","CAP_SYS_ADMIN"})
xsk=newXsk({NIC_queue_id=0,frameNums=65536,ifName="wlan0",xskMapPath="/sys/fs/bpf/dnsdist/xskmap"})
newServer({address="1.1.1.1",MACAddr="be:20:a1:87:a5:ff",xskSocket=xsk,source="192.168.30.211:5301"})
newServer({address="1.0.0.1",MACAddr="be:20:a1:87:a5:ff",xskSocket=xsk,source="192.168.30.211:5302"})
newServer({address="8.8.8.8",MACAddr="be:20:a1:87:a5:ff",xskSocket=xsk,source="192.168.30.211:5303"})
setLocal("192.168.30.211:5300",{xskSocket=xsk})
addACL("192.168.30.211/24")
setServerPolicy(firstAvailable)

I will write documentation after my design is accepted. So I may now need to explain the contents of this config file to make it easier to review my code. In order to use AF_XDP, you must fill in source in newServer, and source cannot be 0.0.0.0. Since it's cumbersome to handle arp requests manually, I didn't implement it. So MACAddr is also a required field. Its value is usually the MAC address of the gateway of the machine where you are running dnsdist.

Don't forget to add all ports used in AF_XDP in xdp.py, such as:

Ports = [5300,5301,5302,5303]

Please remember not to use a port already used in AF_XDP in another application, that will cause that application to not work properly.

Y7n05h avatar Jun 20 '22 12:06 Y7n05h

That is very promising, nice work! I have not looked at the code yet but I'll do a review today or tomorrow. In the meantime I have a couple questions out of curiosity:

Its value is usually the MAC address of the gateway of the machine where you are running dnsdist.

I guess it will be the MAC address of the server if it is on the local network?

Don't forget to add all ports used in AF_XDP in xdp-filter.ebpf.src, such as:

I wonder if we could use an eBPF map to make that more dynamic?

rgacogne avatar Jun 20 '22 13:06 rgacogne

I guess it will be the MAC address of the server if it is on the local network?

Yes.

I wonder if we could use an eBPF map to make that more dynamic?

Yes we could do it, but I'm not sure it's worth it. In my opinion there won't be a lot of ports, and it might be better performance to write them in the code. Of course, if you think it's necessary, I'd love to do it.

Y7n05h avatar Jun 20 '22 13:06 Y7n05h

Yes we could do it, but I'm not sure it's worth it. In my opinion there won't be a lot of ports, and it might be better performance to write them in the code. Of course, if you think it's necessary, I'd love to do it.

Understood! I'm not sure it's worth it either, so let's keep it as it is for now.

rgacogne avatar Jun 20 '22 13:06 rgacogne

Do you have any rough idea of how much work each option represents? I don't mind adding another dependency for AF_XDP support, provided that the license is compatible with our GPLv2, so if it gives us better TCP support and is easier that's fine by me. On the other hand manually handling a subset of TCP is also fine, and we can add better support via a library later if that makes sense.

I'm not quite sure yet. I am trying both scenarios separately. I may be a few days away from getting the answer. But maybe don't worry about the license issue, both mTCP and f-stack use the BSD license. (I remember BSD is compatible with GPLv2. But I'm not very sure about this, because there are too many variants of BSD)

These days, I've tried both scenarios separately. Unfortunately AF_XDP is currently not easy to combine with userspace network stacks like f-stack or mTCP. They support DPDK well, but there may be a lot of work to be done to get f-stack or mTCP to work with AF_XDP. So I prefer to implement a subset of TCP manually. knot is implemented like this. At least I can get some experience and inspiration from knot.

Y7n05h avatar Jun 20 '22 13:06 Y7n05h

These days, I've tried both scenarios separately. Unfortunately AF_XDP is currently not easy to combine with userspace network stacks like f-stack or mTCP. They support DPDK well, but there may be a lot of work to be done to get f-stack or mTCP to work with AF_XDP. So I prefer to implement a subset of TCP manually. knot is implemented like this. At least I can get some experience and inspiration from knot.

That seems like the right thing to do :+1:

rgacogne avatar Jun 20 '22 13:06 rgacogne

I have a nice way to handle manually implemented features that are not supported by TCP. If the peer uses features that are not supported by the manually implemented subset of TCP based on AF_XDP, we can add the peer's ip and port to the BPF Map to prevent AF_XDP BPF programs from redirecting their future packets to userspace. Because the kernel and applications usually take a retry policy for network packets that are not correctly accepted, dnsdist can process these packets correctly through the kernel network stack during retry.

This uses the kernel networking stack as a fallback so that dnsdist always interacts with clients normally.

Y7n05h avatar Jun 23 '22 12:06 Y7n05h

Oh, that sounds very nice! I guess we would need to start the "regular" threads in addition to the AD_XDP ones, right? If so, that's perfectly fine.

rgacogne avatar Jun 23 '22 12:06 rgacogne

Oh, that sounds very nice! I guess we would need to start the "regular" threads in addition to the AD_XDP ones, right? If so, that's perfectly fine.

Yes. I think we just need to implement fallback for the client. Because it is uncertain which features of tcp the client uses, but the server was manually added by the administrator and enabled the AF_XDP feature, so if the server cannot communicate with dnsdist using xsk, it is an error, and fallback should not be used.

Y7n05h avatar Jun 23 '22 12:06 Y7n05h

I see that this is now conflicted because we merged other changes in the meantime. Do not worry about rebasing it, I can do that later so you can focus on this PR.

rgacogne avatar Jun 27 '22 15:06 rgacogne