pdns icon indicating copy to clipboard operation
pdns copied to clipboard

dnsdist memory leak dynamic backend

Open bondar-aleksandr opened this issue 8 months ago • 30 comments

  • Program: dnsdist
  • Issue type: Bug report

Short description

memory usage increased over time. We have dynamic backend discovery based on k8s dns config:

--EDNS client subnet setting
setECSSourcePrefixV4(32)
setECSSourcePrefixV6(128)

-- regular requests
setLocal("0.0.0.0:53", {reusePort=true})
setLocal("0.0.0.0:53", {reusePort=true})
setLocal("0.0.0.0:53", {reusePort=true})
setLocal("0.0.0.0:53", {reusePort=true})

-- rpz data
addLocal("0.0.0.0:5353")

setACL("0.0.0.0/0")
setServerPolicy(roundrobin)

resolver = require 'dnsdist-resolver'
resolver.servers['bind-worker-a-headless.dnscert.svc.cluster.local'] = {pool='local', useProxyProtocol=true, checkName="blocked.domain"}
resolver.servers['bind-worker-b-headless.dnscert.svc.cluster.local'] = {pool='remote', useProxyProtocol=true, checkName="blocked.domain"}
resolver.servers['bind-master-headless.dnscert.svc.cluster.local'] = {pool='rpz', useProxyProtocol=true, useClientSubnet=true, tcpOnly=true, healthCheckMode="up"}

maintenance = resolver.maintenance

webserver("0.0.0.0:8083")
setWebserverConfig({password="XXX", apiKey="", acl="0.0.0.0/0", statsRequireAuthentication=false})
setAPIWritable(false)
controlSocket("127.0.0.1:5199")
setKey("XXX") 

-- disable ANY queries
addAction(QTypeRule(DNSQType.ANY), RCodeAction(DNSRCode.REFUSED))

-- disable caching Notify
addAction(OpcodeRule(DNSOpcode.Notify), SetSkipCacheAction())

-- route to rpz pool
addAction(DSTPortRule(5353), PoolAction("rpz"))

-- route to local pool
addAction(PoolAvailableRule("local"), PoolAction("local"))

-- route to remote pool
addAction(AllRule(), PoolAction("remote"))

Environment

kubernetes version 1.30.11

  • Software version: 1.9.8
  • Software source: PowerDNS repository (docker image)

Steps to reproduce

run dnsdist with specified config in k8s environment

Expected behaviour

more or less contant memory consumption while the load in constant

Actual behaviour

screenshot from k8s grafana pod memory usage

Image

Other information

Nothing

bondar-aleksandr avatar Apr 17 '25 12:04 bondar-aleksandr

Hi! I have tried to reproduce this behaviour in a non-k8s environment, as it is hard to investigate in k8s. I'm not quite sure I have observed the exact same issue you have, but I found one related to memory not being efficiently released when IP addresses change quickly, and proposed a fix in https://github.com/PowerDNS/pdns/pull/15472. The fix is fairly simple and easy to test (no need to recompile anything), so it would be very much appreciated if you could test it in your environment and let me know whether it makes any difference.

rgacogne avatar Apr 25 '25 12:04 rgacogne

Hi, I modified dnsdist-resolver.lua script as you suggested, no result. Memory is still leaking.

Image

bondar-aleksandr avatar Apr 29 '25 08:04 bondar-aleksandr

Thanks for testing! I'm a bit out of ideas on how to reproduce the issue, since it does not happen in my environment. To try to narrow it down, can you confirm that it does not happen without dynamic backends?

rgacogne avatar Apr 29 '25 09:04 rgacogne

backend's addresses change quite rarely, only in case when k8s decided to change amount of PODs due to jorizontal-pod-autoscaler, or when we manually redeploy backends. The screenshot I provided you was made while backend addresses were constant.

bondar-aleksandr avatar Apr 29 '25 11:04 bondar-aleksandr

Right, but do you see the same issue if you use static IP addresses for backends, instead of using the dnsdist-resolver script? I realize this is probably not very practical, but it would be useful to rule that out.

rgacogne avatar Apr 29 '25 11:04 rgacogne

Note that until the in-memory ring buffers (and the packet caches if you have any, which I don't see in your configuration) reach their maximum size, the memory usage growing might not be a leak. If you don't use the live inspection features, or the dynamic blocks, you can reduce the size of the rings via https://www.dnsdist.org/reference/config.html#setRingBuffersSize

rgacogne avatar Apr 29 '25 13:04 rgacogne

We temporary changed config so that there are no dynamic backends, so the leak has gone. Now memory consumption is stable. Current config:

    --EDNS client subnet setting
    setECSSourcePrefixV4(32)
    setECSSourcePrefixV6(128)

    -- regular requests
    setLocal("0.0.0.0:53", {reusePort=true})
    setLocal("0.0.0.0:53", {reusePort=true})
    setLocal("0.0.0.0:53", {reusePort=true})
    setLocal("0.0.0.0:53", {reusePort=true})

    -- zone transfer
    addLocal("0.0.0.0:5353")

    setACL("0.0.0.0/0")
    setServerPolicy(roundrobin)


    newServer({name="L01", address="10.0.4.84", pool="local", useProxyProtocol=true, checkName="blocked.domain"})
    newServer({name="L02", address="10.0.8.155", pool="local", useProxyProtocol=true, checkName="blocked.domain"})
    newServer({name="L03", address="10.0.6.43", pool="local", useProxyProtocol=true, checkName="blocked.domain"})
    newServer({name="L04", address="10.0.4.50", pool="local", useProxyProtocol=true, checkName="blocked.domain"})

    newServer({name="R01", address="10.0.7.229", pool="remote", useProxyProtocol=true, checkName="blocked.domain"})
    newServer({name="R02", address="10.0.3.1", pool="remote", useProxyProtocol=true, checkName="blocked.domain"})
    newServer({name="R03", address="10.0.5.210", pool="remote", useProxyProtocol=true, checkName="blocked.domain"})
    newServer({name="R04", address="10.0.3.114", pool="remote", useProxyProtocol=true, checkName="blocked.domain"})

    newServer({name="RPZ01", healthCheckMode="up", address="10.0.4.203", pool="rpz", useProxyProtocol=true, useClientSubnet=true, tcpOnly=true, healthCheckMode="up"})
    newServer({name="RPZ02", healthCheckMode="up", address="10.0.3.199", pool="rpz", useProxyProtocol=true, useClientSubnet=true, tcpOnly=true, healthCheckMode="up"})


    webserver("0.0.0.0:8083")

    -- disable ANY queries
    addAction(QTypeRule(DNSQType.ANY), RCodeAction(DNSRCode.REFUSED))

    -- disable caching Notify
    addAction(OpcodeRule(DNSOpcode.Notify), SetSkipCacheAction())

    -- route to rpz pool
    addAction(DSTPortRule(5353), PoolAction("rpz"))

    -- route to local pool
    addAction(PoolAvailableRule("local"), PoolAction("local"))

    -- route to remote pool
    addAction(AllRule(), PoolAction("remote"))

Image

Looks like the problem is somewhere in LUA script

bondar-aleksandr avatar May 05 '25 14:05 bondar-aleksandr

Thanks a lot for testing, this is very much appreciated! Any chance you could test against the latest master, as last week we fixed a memory corruption in the getAddressInfo function used by the Lua script: https://github.com/PowerDNS/pdns/pull/15514 ?

rgacogne avatar May 05 '25 14:05 rgacogne

You mean 1.9.9 version? Is there any docker image for it?

bondar-aleksandr avatar May 05 '25 15:05 bondar-aleksandr

No, I meant https://hub.docker.com/r/powerdns/dnsdist-master but please note that this is not a stable branch so I would not recommend using it in production.

rgacogne avatar May 05 '25 15:05 rgacogne

We can't run the image you adviced, got the follwing error: "/usr/include/boost/optional/optional.hpp:1212: boost::optional::reference_type boost::optional<std::basic_string>::get() [T = std::basic_string]: Assertion `this->is_initialized()' failed."

bondar-aleksandr avatar May 06 '25 10:05 bondar-aleksandr

I'm afraid I cannot reproduce that error with the configuration you shared earlier.

rgacogne avatar May 06 '25 10:05 rgacogne

The reason it throw the error I mention is the presense in config the following lines:

controlSocket("127.0.0.1:5199")
setKey("XXX") 

I removed them, but the next error popped up: "/usr/local/bid/dnsdist-resolver: not-found" We looked inside the POD filesystem, so the file was indeed missing. We took the file from existing running image, and mapped it to the dnsdist-master image, but were faced with some python dependency issue:

Image

Can you check if the image has been correcty built?

bondar-aleksandr avatar May 07 '25 07:05 bondar-aleksandr

I don't get it:

  • I cannot reproduce the first issue by adding the controlSocket/setKey lines
  • the dnsdist-resolver.lua shipped in the Docker image no longer uses the /usr/local/bin/dnsdist-resolver Python script, which has even been removed from the repository, so I don't understand why you are getting the second error

My test is done by:

  • starting the docker service systemctl start docker
  • pulling the latest image docker pull powerdns/dnsdist-master
  • writing a configuration file (see below) to /tmp/dnsdist.conf.15446
  • running docker run -v /tmp/dnsdist.conf.15446:/etc/dnsdist/dnsdist.conf -it powerdns/dnsdist-master

The output I get is:

$ docker run -v /tmp/dnsdist.conf.15446:/etc/dnsdist/dnsdist.conf -it powerdns/dnsdist-master
dnsdist 0.0.31383.0.master.g8b832d01d6 comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it according to the terms of the GPL version 2
Raised send buffer to 212992 for local address '127.0.0.1:8053'
Raised receive buffer to 212992 for local address '127.0.0.1:8053'
Listening on 127.0.0.1:8053
Raised send buffer to 212992 for local address '127.0.0.1:5353'
Raised receive buffer to 212992 for local address '127.0.0.1:5353'
Listening on 127.0.0.1:5353
ACL allowing queries from: 10.0.0.0/8, 100.64.0.0/10, 127.0.0.0/8, 169.254.0.0/16, 172.16.0.0/12, 192.168.0.0/16, ::1/128, fc00::/7, fe80::/10
Console ACL allowing connections from: 127.0.0.0/8, ::1/128
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding TCP Client thread
Adding DoH Client thread
No downstream servers defined: all packets will get dropped
Warning, this configuration can use more than 10044 file descriptors, web server and console connections not included, and the current limit is 1024.
You can increase this value by using ulimit.
Accepting control connections on 127.0.0.1:5199
Response code 'Non-Existent domain' received from the secpoll stub resolver 9.9.9.9 for 'dnsdist-0.0.31383.0.master.g8b832d01d6.security-status.secpoll.powerdns.com.'
Error while retrieving the security update for version dnsdist-0.0.31383.0.master.g8b832d01d6: Unable to get a valid Security Status update, domain does not exist
Not validating response for security status update, this is a non-release version.
Added downstream server 8.8.8.8:53
Creating pool remote
Adding server to pool remote
Added downstream server 8.8.4.4:53
Adding server to pool remote
Added downstream server 1.0.0.1:53
Creating pool rpz
Adding server to pool rpz
Added downstream server 1.1.1.1:53
Adding server to pool rpz
Added downstream server 9.9.9.9:53
Creating pool local
Adding server to pool local
Added downstream server 149.112.112.112:53
Adding server to pool local
Marking downstream dns.quad9.net#9.9.9.9 (9.9.9.9:53) as 'up'
Marking downstream dns.quad9.net#149.112.112.112 (149.112.112.112:53) as 'up'
Marking downstream dns.google#8.8.8.8 (8.8.8.8:53) as 'up'
Marking downstream dns.google#8.8.4.4 (8.8.4.4:53) as 'up'

Content of /tmp/dnsdist.conf.15446:

controlSocket("127.0.0.1:5199")
setKey("XXX")

--EDNS client subnet setting
setECSSourcePrefixV4(32)
setECSSourcePrefixV6(128)

-- regular requests
setLocal("127.0.0.1:8053", {reusePort=true})
setLocal("127.0.0.1:8053", {reusePort=true})
setLocal("127.0.0.1:8053", {reusePort=true})
setLocal("127.0.0.1:8053", {reusePort=true})

-- rpz data
addLocal("127.0.0.1:5353")

--setACL("0.0.0.0/0")
setServerPolicy(roundrobin)

resolver = require 'dnsdist-resolver'
resolver.servers['dns.quad9.net'] = {pool='local'}
resolver.servers['dns.google'] = {pool='remote'}
resolver.servers['one.one.one.one'] = {pool='rpz', useClientSubnet=true, tcpOnly=true, healthCheckMode="up"}

maintenance = resolver.maintenance

-- disable ANY queries
addAction(QTypeRule(DNSQType.ANY), RCodeAction(DNSRCode.REFUSED))

-- disable caching Notify
addAction(OpcodeRule(DNSOpcode.Notify), SetSkipCacheAction())

-- route to rpz pool
addAction(DSTPortRule(5353), PoolAction("rpz"))

-- route to local pool
addAction(PoolAvailableRule("local"), PoolAction("local"))

-- route to remote pool
addAction(AllRule(), PoolAction("remote"))

setVerbose(true)

rgacogne avatar May 07 '25 09:05 rgacogne

We managed to run the image, now it doesn't complain about python file (don't know what was the reason). We found out that the reason why it refused to run was "setAPIWritable(false)" config directive. The error we used to get is:

dnsdist: /usr/include/boost/optional/optional.hpp:1212: boost::optional::reference_type boost::optional<std::basic_string<char>>::get() [T = std::basic_string<char>]: Assertion `this->is_initialized()' failed.

For now it's not important, we can skip this config directive. Will see ohw it goes. Let you know, thanks

bondar-aleksandr avatar May 07 '25 12:05 bondar-aleksandr

We found out that the reason why it refused to run was "setAPIWritable(false)" config directive.

Ah, let me see if I can reproduce that, as it would be a bug.

rgacogne avatar May 07 '25 12:05 rgacogne

Reproduced, investigating.

rgacogne avatar May 07 '25 13:05 rgacogne

https://github.com/PowerDNS/pdns/pull/15539

rgacogne avatar May 07 '25 13:05 rgacogne

We see it has finally stopped leaking, memory consumption is stable :)

Image

Which stable version can we expect these changes be merged in?

bondar-aleksandr avatar May 07 '25 15:05 bondar-aleksandr

Great news, thanks! The mentioned fixes will be backported in 1.9.10, which should be released in a couple weeks.

rgacogne avatar May 07 '25 15:05 rgacogne

Wonderful, greatly appreciate your help :)

bondar-aleksandr avatar May 07 '25 15:05 bondar-aleksandr

We deployed dnsdist version 1.9.10, looks like the fixes haven't been backported to that version, memory leak still there :(

bondar-aleksandr avatar May 22 '25 10:05 bondar-aleksandr

It has (https://github.com/PowerDNS/pdns/pull/15519) so if you are still experiencing a leak it might have been a different problem in the first place, but then I don't see how running the master branch helped. Can you confirm that the leak is happening on 1.9.10 but not on master?

rgacogne avatar May 22 '25 10:05 rgacogne

Yes, that's true. We have deployments for these two versions, they run in parallel. Memory usage for dnsdist 1.9.10:

Image

Memory usage for dnsdist/master:

Image

bondar-aleksandr avatar May 22 '25 11:05 bondar-aleksandr

Greating, Should we open a new issue?

bondar-aleksandr avatar May 27 '25 09:05 bondar-aleksandr

I just re-opened this one. What bothers me is that I still haven't found a way to reproduce this, so I can't really fix it..

rgacogne avatar May 27 '25 10:05 rgacogne

We tried to run the latest master build to get rid of the leaking, but looks like some changes broke LUA script processing:

dnsdist 0.0.31611.0.master.gd5b5916db1 comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it according to the terms of the GPL version 2
Passing a plain-text password via the 'password' parameter to 'setWebserverConfig()' is not advised, please consider generating a hashed one using 'hashPassword()' instead.
Passing a plain-text API key via the 'apiKey' parameter to 'setWebserverConfig()' is not advised, please consider generating a hashed one using 'hashPassword()' instead.
sh: 1: /usr/local/bin/dnsdist-resolver: not found
Listening on 0.0.0.0:53
Listening on 0.0.0.0:5353
ACL allowing queries from: 0.0.0.0/0
Console ACL allowing connections from: 127.0.0.0/8, ::1/128
No downstream servers defined: all packets will get dropped
Accepting control connections on 127.0.0.1:5199
Webserver launched on 0.0.0.0:8083
Error while retrieving the security update for version dnsdist-0.0.31611.0.master.gd5b5916db1: Unable to get a valid Security Status update, domain does not exist
Not validating response for security status update, this is a non-release version.
Error during execution of maintenance function(s): ./dnsdist-resolver.lua:89: attempt to call local 'resout' (a nil value)
stack traceback:
	./dnsdist-resolver.lua:89: in function <./dnsdist-resolver.lua:73>

bondar-aleksandr avatar Jun 10 '25 11:06 bondar-aleksandr

Error during execution of maintenance function(s): ./dnsdist-resolver.lua:89: attempt to call local 'resout' (a nil value)
stack traceback:
	./dnsdist-resolver.lua:89: in function <./dnsdist-resolver.lua:73>

This does not match the lines from the dockerdata/dnsdist-resolver.lua of the current master branch, which version are you using?

rgacogne avatar Jun 10 '25 11:06 rgacogne

It was found out that we mapped dnsdist-resolver.lua to the container from outside (we made some changes to the script as we discussed at the begining of the issue). I removed the mapping and now it starts OK, will see how it goes...

bondar-aleksandr avatar Jun 10 '25 12:06 bondar-aleksandr

dnsdist-master image doesn't leak, whereas dnsdist-19:1.9.10 is still leaking :(

bondar-aleksandr avatar Jun 11 '25 08:06 bondar-aleksandr