dnsdist memory leak dynamic backend
- Program: dnsdist
- Issue type: Bug report
Short description
memory usage increased over time. We have dynamic backend discovery based on k8s dns config:
--EDNS client subnet setting
setECSSourcePrefixV4(32)
setECSSourcePrefixV6(128)
-- regular requests
setLocal("0.0.0.0:53", {reusePort=true})
setLocal("0.0.0.0:53", {reusePort=true})
setLocal("0.0.0.0:53", {reusePort=true})
setLocal("0.0.0.0:53", {reusePort=true})
-- rpz data
addLocal("0.0.0.0:5353")
setACL("0.0.0.0/0")
setServerPolicy(roundrobin)
resolver = require 'dnsdist-resolver'
resolver.servers['bind-worker-a-headless.dnscert.svc.cluster.local'] = {pool='local', useProxyProtocol=true, checkName="blocked.domain"}
resolver.servers['bind-worker-b-headless.dnscert.svc.cluster.local'] = {pool='remote', useProxyProtocol=true, checkName="blocked.domain"}
resolver.servers['bind-master-headless.dnscert.svc.cluster.local'] = {pool='rpz', useProxyProtocol=true, useClientSubnet=true, tcpOnly=true, healthCheckMode="up"}
maintenance = resolver.maintenance
webserver("0.0.0.0:8083")
setWebserverConfig({password="XXX", apiKey="", acl="0.0.0.0/0", statsRequireAuthentication=false})
setAPIWritable(false)
controlSocket("127.0.0.1:5199")
setKey("XXX")
-- disable ANY queries
addAction(QTypeRule(DNSQType.ANY), RCodeAction(DNSRCode.REFUSED))
-- disable caching Notify
addAction(OpcodeRule(DNSOpcode.Notify), SetSkipCacheAction())
-- route to rpz pool
addAction(DSTPortRule(5353), PoolAction("rpz"))
-- route to local pool
addAction(PoolAvailableRule("local"), PoolAction("local"))
-- route to remote pool
addAction(AllRule(), PoolAction("remote"))
Environment
kubernetes version 1.30.11
- Software version: 1.9.8
- Software source: PowerDNS repository (docker image)
Steps to reproduce
run dnsdist with specified config in k8s environment
Expected behaviour
more or less contant memory consumption while the load in constant
Actual behaviour
screenshot from k8s grafana pod memory usage
Other information
Nothing
Hi! I have tried to reproduce this behaviour in a non-k8s environment, as it is hard to investigate in k8s. I'm not quite sure I have observed the exact same issue you have, but I found one related to memory not being efficiently released when IP addresses change quickly, and proposed a fix in https://github.com/PowerDNS/pdns/pull/15472. The fix is fairly simple and easy to test (no need to recompile anything), so it would be very much appreciated if you could test it in your environment and let me know whether it makes any difference.
Hi, I modified dnsdist-resolver.lua script as you suggested, no result. Memory is still leaking.
Thanks for testing! I'm a bit out of ideas on how to reproduce the issue, since it does not happen in my environment. To try to narrow it down, can you confirm that it does not happen without dynamic backends?
backend's addresses change quite rarely, only in case when k8s decided to change amount of PODs due to jorizontal-pod-autoscaler, or when we manually redeploy backends. The screenshot I provided you was made while backend addresses were constant.
Right, but do you see the same issue if you use static IP addresses for backends, instead of using the dnsdist-resolver script? I realize this is probably not very practical, but it would be useful to rule that out.
Note that until the in-memory ring buffers (and the packet caches if you have any, which I don't see in your configuration) reach their maximum size, the memory usage growing might not be a leak. If you don't use the live inspection features, or the dynamic blocks, you can reduce the size of the rings via https://www.dnsdist.org/reference/config.html#setRingBuffersSize
We temporary changed config so that there are no dynamic backends, so the leak has gone. Now memory consumption is stable. Current config:
--EDNS client subnet setting
setECSSourcePrefixV4(32)
setECSSourcePrefixV6(128)
-- regular requests
setLocal("0.0.0.0:53", {reusePort=true})
setLocal("0.0.0.0:53", {reusePort=true})
setLocal("0.0.0.0:53", {reusePort=true})
setLocal("0.0.0.0:53", {reusePort=true})
-- zone transfer
addLocal("0.0.0.0:5353")
setACL("0.0.0.0/0")
setServerPolicy(roundrobin)
newServer({name="L01", address="10.0.4.84", pool="local", useProxyProtocol=true, checkName="blocked.domain"})
newServer({name="L02", address="10.0.8.155", pool="local", useProxyProtocol=true, checkName="blocked.domain"})
newServer({name="L03", address="10.0.6.43", pool="local", useProxyProtocol=true, checkName="blocked.domain"})
newServer({name="L04", address="10.0.4.50", pool="local", useProxyProtocol=true, checkName="blocked.domain"})
newServer({name="R01", address="10.0.7.229", pool="remote", useProxyProtocol=true, checkName="blocked.domain"})
newServer({name="R02", address="10.0.3.1", pool="remote", useProxyProtocol=true, checkName="blocked.domain"})
newServer({name="R03", address="10.0.5.210", pool="remote", useProxyProtocol=true, checkName="blocked.domain"})
newServer({name="R04", address="10.0.3.114", pool="remote", useProxyProtocol=true, checkName="blocked.domain"})
newServer({name="RPZ01", healthCheckMode="up", address="10.0.4.203", pool="rpz", useProxyProtocol=true, useClientSubnet=true, tcpOnly=true, healthCheckMode="up"})
newServer({name="RPZ02", healthCheckMode="up", address="10.0.3.199", pool="rpz", useProxyProtocol=true, useClientSubnet=true, tcpOnly=true, healthCheckMode="up"})
webserver("0.0.0.0:8083")
-- disable ANY queries
addAction(QTypeRule(DNSQType.ANY), RCodeAction(DNSRCode.REFUSED))
-- disable caching Notify
addAction(OpcodeRule(DNSOpcode.Notify), SetSkipCacheAction())
-- route to rpz pool
addAction(DSTPortRule(5353), PoolAction("rpz"))
-- route to local pool
addAction(PoolAvailableRule("local"), PoolAction("local"))
-- route to remote pool
addAction(AllRule(), PoolAction("remote"))
Looks like the problem is somewhere in LUA script
Thanks a lot for testing, this is very much appreciated! Any chance you could test against the latest master, as last week we fixed a memory corruption in the getAddressInfo function used by the Lua script: https://github.com/PowerDNS/pdns/pull/15514 ?
You mean 1.9.9 version? Is there any docker image for it?
No, I meant https://hub.docker.com/r/powerdns/dnsdist-master but please note that this is not a stable branch so I would not recommend using it in production.
We can't run the image you adviced, got the follwing error: "/usr/include/boost/optional/optional.hpp:1212: boost::optional::reference_type boost::optional<std::basic_string
I'm afraid I cannot reproduce that error with the configuration you shared earlier.
The reason it throw the error I mention is the presense in config the following lines:
controlSocket("127.0.0.1:5199")
setKey("XXX")
I removed them, but the next error popped up: "/usr/local/bid/dnsdist-resolver: not-found" We looked inside the POD filesystem, so the file was indeed missing. We took the file from existing running image, and mapped it to the dnsdist-master image, but were faced with some python dependency issue:
Can you check if the image has been correcty built?
I don't get it:
- I cannot reproduce the first issue by adding the controlSocket/setKey lines
- the
dnsdist-resolver.luashipped in the Docker image no longer uses the/usr/local/bin/dnsdist-resolverPython script, which has even been removed from the repository, so I don't understand why you are getting the second error
My test is done by:
- starting the docker service
systemctl start docker - pulling the latest image
docker pull powerdns/dnsdist-master - writing a configuration file (see below) to
/tmp/dnsdist.conf.15446 - running
docker run -v /tmp/dnsdist.conf.15446:/etc/dnsdist/dnsdist.conf -it powerdns/dnsdist-master
The output I get is:
$ docker run -v /tmp/dnsdist.conf.15446:/etc/dnsdist/dnsdist.conf -it powerdns/dnsdist-master
dnsdist 0.0.31383.0.master.g8b832d01d6 comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it according to the terms of the GPL version 2
Raised send buffer to 212992 for local address '127.0.0.1:8053'
Raised receive buffer to 212992 for local address '127.0.0.1:8053'
Listening on 127.0.0.1:8053
Raised send buffer to 212992 for local address '127.0.0.1:5353'
Raised receive buffer to 212992 for local address '127.0.0.1:5353'
Listening on 127.0.0.1:5353
ACL allowing queries from: 10.0.0.0/8, 100.64.0.0/10, 127.0.0.0/8, 169.254.0.0/16, 172.16.0.0/12, 192.168.0.0/16, ::1/128, fc00::/7, fe80::/10
Console ACL allowing connections from: 127.0.0.0/8, ::1/128
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding DoH Client thread
No downstream servers defined: all packets will get dropped
Warning, this configuration can use more than 10044 file descriptors, web server and console connections not included, and the current limit is 1024.
You can increase this value by using ulimit.
Accepting control connections on 127.0.0.1:5199
Response code 'Non-Existent domain' received from the secpoll stub resolver 9.9.9.9 for 'dnsdist-0.0.31383.0.master.g8b832d01d6.security-status.secpoll.powerdns.com.'
Error while retrieving the security update for version dnsdist-0.0.31383.0.master.g8b832d01d6: Unable to get a valid Security Status update, domain does not exist
Not validating response for security status update, this is a non-release version.
Added downstream server 8.8.8.8:53
Creating pool remote
Adding server to pool remote
Added downstream server 8.8.4.4:53
Adding server to pool remote
Added downstream server 1.0.0.1:53
Creating pool rpz
Adding server to pool rpz
Added downstream server 1.1.1.1:53
Adding server to pool rpz
Added downstream server 9.9.9.9:53
Creating pool local
Adding server to pool local
Added downstream server 149.112.112.112:53
Adding server to pool local
Marking downstream dns.quad9.net#9.9.9.9 (9.9.9.9:53) as 'up'
Marking downstream dns.quad9.net#149.112.112.112 (149.112.112.112:53) as 'up'
Marking downstream dns.google#8.8.8.8 (8.8.8.8:53) as 'up'
Marking downstream dns.google#8.8.4.4 (8.8.4.4:53) as 'up'
Content of /tmp/dnsdist.conf.15446:
controlSocket("127.0.0.1:5199")
setKey("XXX")
--EDNS client subnet setting
setECSSourcePrefixV4(32)
setECSSourcePrefixV6(128)
-- regular requests
setLocal("127.0.0.1:8053", {reusePort=true})
setLocal("127.0.0.1:8053", {reusePort=true})
setLocal("127.0.0.1:8053", {reusePort=true})
setLocal("127.0.0.1:8053", {reusePort=true})
-- rpz data
addLocal("127.0.0.1:5353")
--setACL("0.0.0.0/0")
setServerPolicy(roundrobin)
resolver = require 'dnsdist-resolver'
resolver.servers['dns.quad9.net'] = {pool='local'}
resolver.servers['dns.google'] = {pool='remote'}
resolver.servers['one.one.one.one'] = {pool='rpz', useClientSubnet=true, tcpOnly=true, healthCheckMode="up"}
maintenance = resolver.maintenance
-- disable ANY queries
addAction(QTypeRule(DNSQType.ANY), RCodeAction(DNSRCode.REFUSED))
-- disable caching Notify
addAction(OpcodeRule(DNSOpcode.Notify), SetSkipCacheAction())
-- route to rpz pool
addAction(DSTPortRule(5353), PoolAction("rpz"))
-- route to local pool
addAction(PoolAvailableRule("local"), PoolAction("local"))
-- route to remote pool
addAction(AllRule(), PoolAction("remote"))
setVerbose(true)
We managed to run the image, now it doesn't complain about python file (don't know what was the reason). We found out that the reason why it refused to run was "setAPIWritable(false)" config directive. The error we used to get is:
dnsdist: /usr/include/boost/optional/optional.hpp:1212: boost::optional::reference_type boost::optional<std::basic_string<char>>::get() [T = std::basic_string<char>]: Assertion `this->is_initialized()' failed.
For now it's not important, we can skip this config directive. Will see ohw it goes. Let you know, thanks
We found out that the reason why it refused to run was "setAPIWritable(false)" config directive.
Ah, let me see if I can reproduce that, as it would be a bug.
Reproduced, investigating.
https://github.com/PowerDNS/pdns/pull/15539
We see it has finally stopped leaking, memory consumption is stable :)
Which stable version can we expect these changes be merged in?
Great news, thanks! The mentioned fixes will be backported in 1.9.10, which should be released in a couple weeks.
Wonderful, greatly appreciate your help :)
We deployed dnsdist version 1.9.10, looks like the fixes haven't been backported to that version, memory leak still there :(
It has (https://github.com/PowerDNS/pdns/pull/15519) so if you are still experiencing a leak it might have been a different problem in the first place, but then I don't see how running the master branch helped. Can you confirm that the leak is happening on 1.9.10 but not on master?
Yes, that's true. We have deployments for these two versions, they run in parallel. Memory usage for dnsdist 1.9.10:
Memory usage for dnsdist/master:
Greating, Should we open a new issue?
I just re-opened this one. What bothers me is that I still haven't found a way to reproduce this, so I can't really fix it..
We tried to run the latest master build to get rid of the leaking, but looks like some changes broke LUA script processing:
dnsdist 0.0.31611.0.master.gd5b5916db1 comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it according to the terms of the GPL version 2
Passing a plain-text password via the 'password' parameter to 'setWebserverConfig()' is not advised, please consider generating a hashed one using 'hashPassword()' instead.
Passing a plain-text API key via the 'apiKey' parameter to 'setWebserverConfig()' is not advised, please consider generating a hashed one using 'hashPassword()' instead.
sh: 1: /usr/local/bin/dnsdist-resolver: not found
Listening on 0.0.0.0:53
Listening on 0.0.0.0:5353
ACL allowing queries from: 0.0.0.0/0
Console ACL allowing connections from: 127.0.0.0/8, ::1/128
No downstream servers defined: all packets will get dropped
Accepting control connections on 127.0.0.1:5199
Webserver launched on 0.0.0.0:8083
Error while retrieving the security update for version dnsdist-0.0.31611.0.master.gd5b5916db1: Unable to get a valid Security Status update, domain does not exist
Not validating response for security status update, this is a non-release version.
Error during execution of maintenance function(s): ./dnsdist-resolver.lua:89: attempt to call local 'resout' (a nil value)
stack traceback:
./dnsdist-resolver.lua:89: in function <./dnsdist-resolver.lua:73>
Error during execution of maintenance function(s): ./dnsdist-resolver.lua:89: attempt to call local 'resout' (a nil value)
stack traceback:
./dnsdist-resolver.lua:89: in function <./dnsdist-resolver.lua:73>
This does not match the lines from the dockerdata/dnsdist-resolver.lua of the current master branch, which version are you using?
It was found out that we mapped dnsdist-resolver.lua to the container from outside (we made some changes to the script as we discussed at the begining of the issue). I removed the mapping and now it starts OK, will see how it goes...
dnsdist-master image doesn't leak, whereas dnsdist-19:1.9.10 is still leaking :(