pdns
pdns copied to clipboard
dnsdist: newBPFFilter does not work on recent Linux kernel
I know this is rather a support question. @omoerbeek would like us to discuss it here later with @rgacogne
- Program: dnsdist 1.7.2
- Issue type: Error report
Short description
After upgrading the OS from SLES15-SP3 to SLES15-SP4 and Linux kernel 5.3 to 5.14, the newBPFFilter()
function in dnsdist does not work anymore.
If I start dnsdist via systemd, it fails with this error message in the journal: Fatal Lua error: [string "chunk"]:294: Caught exception: Error creating a BPF map of size 1024: Operation not permitted
Environment
- Operating system: SLES15-SP4, Linux 5.14
- Software version: dnsdist 1.7.2
- Software source: self compiled
Steps to reproduce
This is the function call that doesn't work:
bpf = newBPFFilter(1024, 1024, 1024)
Other information
We already discussed this issue on #powerdns. Here the transcript:
<winfried> Hi, in dnsdist 1.7.0 I have this Statement: bpf = newBPFFilter(1024, 1024, 1024)
<winfried> It does not work anymore in 1.7.2: Caught exception: Error creating a BPF map of size 1024: Operation not permitted
<Habbie> your syntax is correct
<Habbie> the problem lies deeper
<Habbie> probably, indeed, permissions/capabilities
<Habbie> also the formatting of that bit of docs is terrible
<ottom> yes, the notes about net.core.optmem_max and RLIMIT_MEMLOCK and the capabilities should be grouped
<ottom> there is a PR to handle RLIMIT_MEMLOCK automatically, btw
<ottom> winfried: are you running with systemd? in that case, please show your unit file
<winfried> ottom: https://p.6core.net/p/WSBtmRcVAYaBzK47GOYZvlyd
<winfried> ottom: Linux 5.14.21
<ottom> winfried: is this error reported on startup or later when changing config?
<winfried> ottom: this is on startup
<ottom> i thik you are at least missing capabilities, but i have no clue yet why this does not show on 1.7.0 for you
winfried> ottom: because I have a new Linunx kernel on that box I guess
<ottom> ah, good to know not only dndist changed
<ottom> *dnsdist
<winfried> ottom: Yes sorry about that! I'm testing a OS upgrade process
<ottom> winfried: below works for me. I do not know enough about systemd if both lines are needed:
<ottom> CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_SYS_ADMIN
<ottom> AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_SYS_ADMIN
<ottom> if youalso want to change the eBPF config runtime, you also might need to use https://dnsdist.org/reference/config.html#addCapabilitiesToRetain
<ottom> alternatively, kernel.unprivileged_bpf_disabled can be tweaked. But https://dnsdist.org/advanced/ebpf.html#requirements suggest that is only relevant for kernel >= 5.15
<winfried> ottom: still "Error creating a BPF map of size 1024: Operation not permitted" after changing CapabilityBoundingSet and AmbientCapabilities and systemctl daemon-reload
<winfried> ottom: Do I have to call it like this: addCapabilitiesToRetain("CAP_SYS_ADMIN")
<ottom> winfried: yes, that should work, it is a single string or a list of strings
<ottom> if that does not work, I'm out of ideas
<winfried> ottom: no luck so far
<ottom> whats your value of kernel.unprivileged_bpf_disabled?
<ottom> sysctl kernel.unprivileged_bpf_disabled
<winfried> ottom: 2
<ottom> same as here, but i'm running 5.4.0
<ottom> I'm sure Remi would have a clue, but he's on vacation
<ottom> winfried: it would be nice if you can collect the info from this chat in an issue
<ottom> if you have a chance
<winfried> ottom: Yes of course I can file a GH issue
It looks like the hints in the docs are not enough to get this working on a 5.14 kernel, suggesting this is a docs issue. As mentioned, the docs could use some reorganizing as well.
Coming back to this and re-reading things, I think CAP_BPF
instead of CAP_SYS_ADMIN
should be used for recent (>= 5.8) kernels. Dunno why i missed that, as the unit file mentions it.
Setting:
CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_SYS_ADMIN
AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_SYS_ADMIN
LimitMEMLOCK=infinity
should be enough to make it work (I just tested on a 5.18.16). CAP_BPF
is a subset of CAP_SYS_ADMIN
so setting CAP_SYS_ADMIN
on a recent kernel should not be an issue.
If it does not work we will need to test on the exact kernel provided by SLES15-SP4, as they might have done something weird in their backports.
CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_SYS_ADMIN AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_SYS_ADMIN LimitMEMLOCK=infinity
That's exactly what I tried. But it does not work. I also tried CAP_BPF
. But still this error message:
Caught exception: Error creating a BPF map of size 1024: Operation not permitted
I'll have to set up a SLES 15-SP4 box to test, then. In the meantime, would you be able to confirm that issuing sysctl kernel.unprivileged_bpf_disabled=0
allows dnsdist to start? It can be reverted by either rebooting or issuing sysctl kernel.unprivileged_bpf_disabled=2
afterwards.
would you be able to confirm that issuing
sysctl kernel.unprivileged_bpf_disabled=0
allows dnsdist to start?
Yes, I can confirm, with this setting dnsdist starts without errors.
I have not been able to reproduce yet, I'll do more tests tomorrow:
# cat /etc/os-release
NAME="SLES"
VERSION="15-SP4"
VERSION_ID="15.4"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp4"
DOCUMENTATION_URL="https://documentation.suse.com/"
# uname -a
[...] 5.14.21-150400.24.11-default [...]
# sysctl kernel.unprivileged_bpf_disabled
kernel.unprivileged_bpf_disabled = 2
# cat /usr/local/etc/dnsdist.conf
print("a")
bpf = newBPFFilter(1024, 1024, 1024)
print(bpf)
print('b')
Aug 08 16:46:18 ip-172-31-38-199 systemd[1]: Starting DNS Loadbalancer...
Aug 08 16:46:18 ip-172-31-38-199 dnsdist[12935]: a
Aug 08 16:46:18 ip-172-31-38-199 dnsdist[12935]: userdata 0x55e58387b5a8
Aug 08 16:46:18 ip-172-31-38-199 dnsdist[12935]: b
Aug 08 16:46:18 ip-172-31-38-199 dnsdist[12935]: Configuration '/usr/local/etc/dnsdist.conf' OK!
Aug 08 16:46:18 ip-172-31-38-199 dnsdist[12935]: Configuration '/usr/local/etc/dnsdist.conf' OK!
Aug 08 16:46:19 ip-172-31-38-199 dnsdist[12936]: a
Aug 08 16:46:19 ip-172-31-38-199 dnsdist[12936]: userdata 0x55faeae46348
Aug 08 16:46:19 ip-172-31-38-199 dnsdist[12936]: b
Aug 08 16:46:19 ip-172-31-38-199 dnsdist[12936]: Listening on 127.0.0.1:53
Aug 08 16:46:19 ip-172-31-38-199 dnsdist[12936]: dnsdist 1.7.2 comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it according to the terms of the GPL version 2
Aug 08 16:46:19 ip-172-31-38-199 dnsdist[12936]: ACL allowing queries from: 10.0.0.0/8, 100.64.0.0/10, 127.0.0.0/8, 169.254.0.0/16, 172.16.0.0/12, 192.168.0.0/16, ::1/128, fc00::/7, fe80::/10
Aug 08 16:46:19 ip-172-31-38-199 dnsdist[12936]: Console ACL allowing connections from: 127.0.0.0/8, ::1/128
Aug 08 16:46:19 ip-172-31-38-199 dnsdist[12936]: No downstream servers defined: all packets will get dropped
Aug 08 16:46:19 ip-172-31-38-199 systemd[1]: Started DNS Loadbalancer.
Aug 08 16:46:19 ip-172-31-38-199 dnsdist[12936]: Polled security status of version 1.7.2 at startup, no known issues reported: OK
[Unit]
Description=DNS Loadbalancer
Documentation=man:dnsdist(1)
Documentation=https://dnsdist.org
Wants=network-online.target
After=network-online.target
[Service]
ExecStartPre=/usr/local/bin/dnsdist --check-config
# Note: when editing the ExecStart command, keep --supervised and --disable-syslog
ExecStart=/usr/local/bin/dnsdist --supervised --disable-syslog
User=dnsdist
Group=dnsdist
SyslogIdentifier=dnsdist
Type=notify
Restart=on-failure
RestartSec=2
TimeoutStopSec=5
StartLimitInterval=0
# Tuning
TasksMax=8192
LimitNOFILE=16384
# Note: increasing the amount of lockable memory is required to use eBPF support
LimitMEMLOCK=infinity
# Sandboxing
# Note: adding CAP_SYS_ADMIN (or CAP_BPF for Linux >= 5.8) is required to use eBPF support,
# and CAP_NET_RAW to be able to set the source interface to contact a backend
CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_BPF
AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_BPF
LockPersonality=true
NoNewPrivileges=true
PrivateDevices=true
PrivateTmp=true
# Setting PrivateUsers=true prevents us from opening our sockets
ProtectClock=true
ProtectControlGroups=true
ProtectHome=true
ProtectHostname=true
ProtectKernelLogs=true
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectSystem=full
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
SystemCallArchitectures=native
SystemCallFilter=~ @clock @debug @module @mount @raw-io @reboot @swap @cpu-emulation @obsolete
[Install]
WantedBy=multi-user.target
It looks like I don't even need LimitMEMLOCK=infinity
, adding CAP_BPF
to both CapabilityBoundingSet
and AmbientCapabilities
with systemctl edit --full dnsdist
then issuing a systemctl restart dnsdist
works for me.
It turned out, AppArmor is the reason why it fails here.
# cat /etc/apparmor.d/usr.sbin.dnsdist
#include <tunables/global>
/usr/sbin/dnsdist {
#include <abstractions/base>
#include <abstractions/nameservice>
capability net_bind_service,
capability setgid,
capability setuid,
network tcp,
network udp,
/etc/dnsdist/** r,
@{PROC}/@{pid}/** r,
# Site-specific additions and overrides. See local/README for details.
#include <local/usr.sbin.dnsdist>
}
# cat /etc/apparmor.d/local/usr.sbin.dnsdist
/etc/dnscrypt/** rw,
/home/cert/dnscrypt/** rw,
But I don't know what is missing here.
echo "capability bpf," >> /etc/apparmor.d/local/usr.sbin.dnsdist
systemctl restart apparmor.service
did the trick. Now it works with your suggestions.
Oh, I did not suspect AppArmor, well done!
Is that AppArmor policy an internal one, or can we submit a patch to it?
It is internal. But it was inspired from a OBS project. I can try to send the maintainer a note about it.
Will the unit file get an update as well?
I'm not sure I want to grant more privileges by default, since most people are not using eBPF, but I might be wrong. In any case I going to add a mention about AppArmor in the documentation and the unit file!
In any case I going to add a mention about AppArmor in the documentation and the unit file!
Yes, that should be sufficient.
Thanks a lot for the feedback, Winfried, much appreciated!
Closing since #11839 has been merged.