dns icon indicating copy to clipboard operation
dns copied to clipboard

NodeLocalDNS performance regression with coredns 1.8.3

Open prameshj opened this issue 3 years ago • 41 comments

I am seeing a higher number of cache misses for the same test in the newer nodelocaldns version - 1.20.

The test consists of ~3k alpine pods doing DNS lookups with searchpath expansion for roughly 30 mins. The number of cache misses is 4x more with the newer version.

5k hits/sec, 1268 misses/sec = total 6.2k/sec with coredns 1.8.3

image



6.2k hits/sec, 300 misses/sec = total 6.5k/sec with coredns 1.7.0

image

I am unable to reproduce this in a single node cluster though. I ran a simple test that does "dig kubernetes.svc.cluster.local" lookup every 1s which will result in NXDOMAIN (does not have namespace in the name). This will cause 12 misses per minute, since NXOMAIN in cached for 5s. I measured the number of cache misses for 1.8.3 and 1.7.0 and they look somewhat similar.

image
Both versions show misses of 12 per minute for the most part, but every now and then, they show 48 or 24 misses per minute. In 1.8.3 the cache misses occur more frequently, in 1.7.0, they appear to be slightly more batched.

prameshj avatar Sep 02 '21 19:09 prameshj

Creating this issue to identify what is causing the extra cache misses and if any changes in cache plugin in 1.8.3 could explain it.

/cc @chrisohaver

prameshj avatar Sep 02 '21 19:09 prameshj

Can you provide more details on how to reproduce this?

SuperQ avatar Nov 14 '21 10:11 SuperQ

The issue can be repro'ed by running a large number od dnsclient pods that are using the image and parameters here and here.

In the automated test, roughly 3k pods were running and each time the cache misses in nodelocaldns were higher with coreDNS 1.8.3

prameshj avatar Nov 15 '21 06:11 prameshj

Is the ratio of pod:node higher in the 3k pod test than smaller scale tests? I would think it would be similarly proportioned across all scale tests. Ditto for QPS per nodeloacldns instance. And I presume that the cluster service discovery deployment is CoreDNS, and is scaled proportionally to the number of nodes.

  • Are the nodelocaldns instances using more CPU in the 3k tests?
  • Is the response latency increased compared to smaller tests for hits ... misses?
  • Have any versions in between 1.7.0 and 1.8.3 been tested? How about > 1.8.3?
  • Is there any information/metrics available for the CoreDNS cluster Service Discovery Deployment during the test? Perhaps the cluster CoreDNS deployment is slower to respond in the 3k test, such that when an entry does expire in nodelocaldns more queries are received before the cache is refreshed.

cache related changes after 1.7.0 through 1.8.3:

1.7.1: https://github.com/coredns/coredns/pull/4085 - handle caching responses for DNSSEC requests 1.8.0: https://github.com/coredns/coredns/pull/4148 - follow up fix to coredns/coredns#4085 1.8.0: https://github.com/coredns/coredns/pull/4190 - follow up fix to coredns/coredns#4085 1.8.1: https://github.com/coredns/coredns/pull/4368 - prefetch related (I don't think nodelocaldns configures prefetching)

cache related changes >= 1.8.4:

1.8.5: https://github.com/coredns/coredns/pull/4736 - cache miss related follow up fix to coredns/coredns#4085 1.8.5: https://github.com/coredns/coredns/pull/4781 - prometheus metrics schema change (deprecating cache-hit metric)

If cache is to blame, then the DNSSEC change in 1.7.1 seems like the most suspicious thing. Post 1.8.3, https://github.com/coredns/coredns/pull/4736 is at least topically related.

If not cache related then possibly forward related.

chrisohaver avatar Nov 15 '21 15:11 chrisohaver

Is the ratio of pod:node higher in the 3k pod test than smaller scale tests?

Yes, the ratio is much higher. In the smaller test, I only ran a couple of pods. All pods lookup the same names, so I assumed one pod's lookups can't flush cached entries for a different pod, causing potentially more cache misses.

3k client pods + CoreDNS 1.7.0 as NodeLocalDNS and 3k client pods + CoreDNS 1.8.3 as NodeLocalDNS were compared. single client pod + CoreDNS 1.7.0 as NodeLocalDNS and single client pod + CoreDNS 1.8.3 as NodeLocalDNS were compared.

Have any versions in between 1.7.0 and 1.8.3 been tested? How about > 1.8.3? No, i was meaning to do this, but haven't gotten a chance yet. Let me try 1.8.4 sometime this week and report back.

I was suspecting the DNSSEC caching change - but if that's the case, it should have shown more cache misses in the single pod test too.

prameshj avatar Nov 16 '21 01:11 prameshj

Thanks for some great leads! Let me try a smaller scale test with similar pod density and also try 1.8.4 coreDNS and report back.

prameshj avatar Nov 16 '21 01:11 prameshj

Let me try 1.8.4 sometime this week and report back.

I think 1.8.5 would be better to test if you do test again. There was one cache related fix in 1.8.5, but in 1.8.4 there wasn’t anything I saw cache related.

chrisohaver avatar Nov 16 '21 01:11 chrisohaver

I haven't had a chance to try it in 1.8.5 yet. I tried to update dependency but kept running into errors like:

/go/pkg/mod/k8s.io/[email protected]/logs/config.go:41:22: cannot use nil as type logr.Logger in argument to LogRegistry.Register
/go/pkg/mod/k8s.io/[email protected]/logs/registry.go:56:3: cannot use nil as type logr.Logger in return argument

I can push my WIP PR if that helps.

~~I do have a coredns 1.7.1 image that I can try. Will update here by next week. ~~ Update - I did not see the issue with 1.7.1, 1.8.1 or 1.8.2. I see the issue in 1.8.3. According to the release notes, 1.8.2 was not tagged correctly, so the test passing in 1.8.2 can be disregarded.

Some change between 1.8.1 and 1.8.3 is causing this. NodeLocalDNS does not use any of the plugins included in the the release notes, except forward plugin. In forward, only metrics were changed.

The core changes are:

core: Also clear do and size (https://github.com/coredns/coredns/pull/4465)
core: Flag blacklisting not needed anymore (https://github.com/coredns/coredns/pull/4420)
core: Set http request in writer (https://github.com/coredns/coredns/pull/4445)

I can try setting the do bit and size and see if that changes anything.

prameshj avatar Dec 03 '21 18:12 prameshj

@chrisohaver , any thoughts on the prev comment? Thanks!

prameshj avatar Dec 20 '21 18:12 prameshj

I did not see the issue with ... 1.8.2. I see the issue in 1.8.3.

That's really odd, because the tags 1.8.2 and 1.8.3 contain the same code. The only difference is that 1.8.3 has a fix to the release script, but that does not affect the code, and is not compiled into coredns in any way.

1.8.2 was not tagged correctly, so the test passing in 1.8.2 can be disregarded.

I don't think that's true. 1.8.2 had a release failure, where the container image build script failed. The 1.8.2 tag is still a "valid tag" and if built, should be identical to 1.8.3.

chrisohaver avatar Dec 20 '21 19:12 chrisohaver

I compared the 1.8.2 and 1.8.3 tags to confirm there were no code changes between those two tags ...

https://github.com/coredns/coredns/compare/v1.8.2...v1.8.3

chrisohaver avatar Dec 20 '21 20:12 chrisohaver

I compared the 1.8.2 and 1.8.3 tags to confirm there were no code changes between those two tags ...

coredns/[email protected]

Yup, i noticed the same too. Now, I can't repro it reliably in 1.8.3 as well. I will try to configure a CI to see how often we see it.

prameshj avatar Dec 21 '21 19:12 prameshj

I was able to repro in 1.7.1 through 1.8.3 - the tests pass if the number of kube-dns replicas are increased(so there are more DNS pods to serve the cache misses) , but fails with default autoscaler parameters.

The test passes with 1.7.0. I manually patched in the change https://github.com/coredns/coredns/pull/4736, but that did not help either. I will try after reverting the DNSSEC changes in https://github.com/coredns/coredns/pull/4085 locally.

Not sure how the DNSSEC changes could cause this though - If I understand correctly, the cache miss sets the DO bit when sending the request upstream. Perhaps dnsmasq(in kube-dns) adds a delay in processing, as a result of this flag.

prameshj avatar Dec 24 '21 01:12 prameshj

I tried to revert https://github.com/coredns/coredns/pull/4085 on top of coredns 1.7.1 - https://github.com/kubernetes/dns/commit/373ebff7354b7c264c934313f933f6a4cb2aa879

But, nodelocaldns cache misses still show the dnssec bit set - Screen Shot 2022-01-25 at 2 49 43 PM

Cache misses from coredns 1.7.0 do not show DNSSEC ok bit - Screen Shot 2022-01-24 at 10 32 23 PM

@chrisohaver any other changes needed to disable default DNSSEC? Full list of changes between 1.7.0 and 1.7.1 - https://github.com/coredns/coredns/compare/v1.7.0...v1.7.1

prameshj avatar Jan 25 '22 23:01 prameshj

any other changes needed to disable default DNSSEC?

No, that should be it. The commit 373ebff is titled "Manually revert DNSSEC by default in coreDNS". The "manually" part suggesting that maybe it was done line by line? In which case perhaps some part of coredns/coredns#4085 was missed.

chrisohaver avatar Jan 25 '22 23:01 chrisohaver

any other changes needed to disable default DNSSEC?

No, that should be it. The commit 373ebff is titled "Manually revert DNSSEC by default in coreDNS". The "manually" part suggesting that maybe it was done line by line? In which case perhaps some part of coredns/coredns#4085 was missed.

I did this with just git revert acf9a0fa19928e605ac8ac3314890c9fef73e16b on my local coredns fork and copied over all the modified files into k8s/dns/vendor directory.

All files except README and *_test.go were applied. Is there a better way to do this?

prameshj avatar Jan 26 '22 00:01 prameshj

Perhaps try reverting the entire cache plugin directory to 1.7.0, and see if the 4X cache miss rate is still seen in tests. This would narrow down the cause to a cache plugin change, or something else entirely.

Are you judging "able to repro" based on tests passing/failing, or by measuring cache misses? If the former, then results could be flaky/erratic - i.e. could barely pass one time and barely fail another time.

chrisohaver avatar Jan 26 '22 00:01 chrisohaver

Perhaps try reverting the entire cache plugin directory to 1.7.0, and see if the 4X cache miss rate is still seen in tests. This would narrow down the cause to a cache plugin change, or something else entirely.

Good idea, i tried this and metrics were the only additional change - https://github.com/kubernetes/dns/commit/8ac5a648b7b3f1296997679c752c910afeecae7c. I suspect some change in forward plugin might be setting DNSSEC OK.

Are you judging "able to repro" based on tests passing/failing, or by measuring cache misses? If the former, then results could be flaky/erratic - i.e. could barely pass one time and barely fail another time.

Based on tests passing/failing, they fail if there are DNS drops, which happens if the large number of cache misses overwhelm kubedns. But since my test image is still setting DO bit on for cache misses, no point running at scale till I fix that.

I think the 4x cache misses might be a cascading effect - DO bit somehow causes upstream to slow down, this fails other queries(that could have been successful and cached), so there are more cache misses. That's my theory.

prameshj avatar Jan 26 '22 23:01 prameshj

I don't see any obvious changes in forward plugin either.

Taking a step back, even if the default dnssec change is the trigger, I don't see any workaround other than to stay at the older 1.7.0 version. @chrisohaver do we expect the DNSSEC behavior to be configurable on the cache plugin?

Until we can figure out how to handle the increased load, we can stay at this older version. If there is a pressing need to pickup a newer coreDNS version, things can be evaluated then.

cc @bowei @mag-kol @cezarygerard

prameshj avatar Jan 27 '22 00:01 prameshj

I’ll carve out some time this week to dedicate to looking into this in more detail.

@chrisohaver do we expect the DNSSEC behavior to be configurable on the cache plugin?

Highly unlikely. If it really ends up being an actual issue, then DNSSEC cache support would probably be re-implemented differently.

Edit: To clarify, If the presence of a DO bit proves to be an issue, CoreDNS would probably re-implement cache support for DNSSEC replies in a different way (e.g. not setting DO, and caching responses to DO/non-DO requests separately). It's unlikely that CoreDNS would make it an option to support caching DO request responses in two different ways.

chrisohaver avatar Jan 27 '22 01:01 chrisohaver

I think the 4x cache misses might be a cascading effect - DO bit somehow causes upstream to slow down, this fails other queries(that could have been successful and cached), so there are more cache misses. That's my theory.

@prameshj, Can you run an isolated test that compares the kubedns+dnsmasq latency of a single query with DO bit set, and DO bit not set? There should be a significant difference in latency if this is true. It should be on the order of 4X longer, since it's the latency that most significantly drives the number of cache misses, right?

chrisohaver avatar Jan 27 '22 19:01 chrisohaver

@prameshj, are all queries in the test going to kube-dns, or is a portion going to an upstream dns server?

chrisohaver avatar Jan 27 '22 20:01 chrisohaver

@prameshj, how many unique names are queried in the test? I've been assuming it's a very small number.

chrisohaver avatar Jan 27 '22 20:01 chrisohaver

I think the 4x cache misses might be a cascading effect - DO bit somehow causes upstream to slow down, this fails other queries(that could have been successful and cached), so there are more cache misses. That's my theory.

@prameshj, Can you run an isolated test that compares the kubedns+dnsmasq latency of a single query with DO bit set, and DO bit not set? There should be a significant difference in latency if this is true. It should be on the order of 4X longer, since it's the latency that most significantly drives the number of cache misses, right?

Good idea, I can try this.

are all queries in the test going to kube-dns, or is a portion going to an upstream dns server?

only kube-dns. The forwarded latency is high when upstream is kube-dns.

how many unique names are queried in the test? I've been assuming it's a very small number. It is a small number - https://github.com/kubernetes/perf-tests/blob/release-1.23/clusterloader2/testing/load/configmap.yaml#L16-L40

prameshj avatar Jan 27 '22 21:01 prameshj

only kube-dns. The forwarded latency is high when upstream is kube-dns.

There are some names in there that would normally be sent to the upstream name servers. e.g. google.com How do the test instances of node-cache handles those queries? Does it drop them?

Other than google.com do any of the other names return NXDOMAIN? e.g. kubernetes.cluster.local is one.

chrisohaver avatar Jan 27 '22 21:01 chrisohaver

OK, assuming that metrics-server is a valid service in the same namespace of the client-pods (which I am assuming is default) it looks like the list would expand to 52 unique names, based on search path expansion. That's still quite small. Only 2 queries would be send to upstream DNS google.com for A/AAAA.

I'm also assuming there are no local domains in /etc/resolv.conf in the clusterfirst search path. If there are, there would be 21 additional unique queries per local domain, and those would all be sent to the upstream DNS.

Each query, with expected number of search path expansions:

1    kubernetes A
1    kubernetes AAAA
4    kubernetes.cluster.local A
4    kubernetes.cluster.local AAAA
4    kubernetes.svc.cluster.local A
4    kubernetes.svc.cluster.local AAAA
4    kubernetes.default.svc.cluster.local A
4    kubernetes.default.svc.cluster.local AAAA
4    google.com.svc.cluster.local A
4    google.com.svc.cluster.local AAAA
4    google.com.default.svc.cluster.local A
4    google.com.default.svc.cluster.local AAAA
4    google.com.cluster.local A
4    google.com.cluster.local AAAA
4    google.com A
4    google.com AAAA
4    1-2-3-4.default.pod.cluster.local A
1    metrics-server A
1    metrics-server AAAA
4    metrics-server.cluster.local A
4    metrics-server.cluster.local AAAA
4    metrics-server.svc.cluster.local A
4    metrics-server.svc.cluster.local AAAA
4    metrics-server.kube-system.svc.cluster.local A
4    metrics-server.kube-system.svc.cluster.local AAAA

chrisohaver avatar Jan 27 '22 22:01 chrisohaver

OK, assuming that metrics-server is a valid service in the same namespace of the client-pods (which I am assuming is default)

Actually - per other queries, I see metrics-server is probably in kube-system, so those 2 would be sent upstream also like the google.com ones. (and they'd also spawn 6 addlt queries to kubedns)

And if the client pods are not in default, the 2 kubernetes queries would also be sent upstream (and 6 addlt queries to kubedns)

chrisohaver avatar Jan 27 '22 22:01 chrisohaver

There are some names in there that would normally be sent to the upstream name servers. e.g. google.com How do the test instances of node-cache handles those queries? Does it drop them?

Sorry, I misspoke. There is a valid upstream server that responds to google.com and other queries not ending with "cluster.locsl". The test uses dnsperf tool which DOES NOT perform searchpath expansion. That's why the query list simulates an expanded lookup.

The increased latency comes from the upstream kube-dns service (when i group fowarding latency by "to" label). frequent 8s latency, the upstream shows under 1s.

image

prameshj avatar Jan 28 '22 02:01 prameshj

Sorry, I misspoke. There is a valid upstream server that responds to google.com and other queries not ending with "cluster.local". The increased latency comes from the upstream kube-dns service [...]. frequent 8s latency, the upstream shows under 1s.

Thanks. What is serving the upstream queries? Do you see increased latency for responses from upstream in 1.7.1+ vs 1.7.0?

chrisohaver avatar Jan 28 '22 15:01 chrisohaver

@prameshj, do you know what the configuration of dnsmasq is in a kube-dns deployment (i.e. flags and contents of config file)? The dnsmasq tool does have dnssec support, so we should verify how that is configured to see if there could be any interaction there.

Is it possible to run these test with CoreDNS as the cluster Service discovery instead of Kube-dns for comparison?

chrisohaver avatar Jan 28 '22 16:01 chrisohaver