unbound icon indicating copy to clipboard operation
unbound copied to clipboard

Unbound Serve expired; cache hit rate reducing with time

Open sirizake opened this issue 1 year ago • 7 comments

Hi I have installed unbound version: 1.20.0 on a FreeBSD 14 server. This was working fine until the server lost internet connectivity to the upstream internet provider. Prior to this the average cache hit rate on the server was 99.0% with only 1% recursive replies. Part of my unbound.conf file is shown below

server:
    prefetch: yes
    serve-expired: yes
# serve-expired-ttl: 0
 # serve-expired-ttl-reset: no

After loss of internet average cache hit rate has reduced to 14% whiles recursive queries is showing 86% (still internet is not restored) My expectation is Caching server should continue to serve expired and keep the cache hit rate high because the serve-expired-ttl is default (meaning it should continue serving cached content until upstream is restored). My observation is the opposite. Is there anything I am missing? How can i ensure that the caching server will continue serving cache data several days after upstream internet is lost Regards Isaac

sirizake avatar Aug 01 '24 10:08 sirizake

If this is using cachemiss or cachehits to measure it, it turns out that the code counts serve expired refresh attempts as a cachemiss. So the cachemiss counter is increased when a query comes in and gets an expired answer and then recursion takes place to refresh the data item. The counter for the number of expired answers is incremented as well. So perhaps the measurement reading is due to the statistic counters, not really the server response behaviour itself.

wcawijngaards avatar Aug 01 '24 10:08 wcawijngaards

[ Edited the issue post to put the config in a code block, that shows the commented out entries with #. ]

wcawijngaards avatar Aug 01 '24 10:08 wcawijngaards

@wcawijngaards thanks But is there a way to measure the server response behaviour itself in this circumstance?

sirizake avatar Aug 01 '24 10:08 sirizake

That would need a test to see if there is an expired answer at that time. That there is a recursion to refresh is not so much the problem, I would think. If the cache size was too small, also expired answers from cache could fail.

wcawijngaards avatar Aug 01 '24 15:08 wcawijngaards

i see the total entries cached and the total memory cache values have remained the same with time too

sirizake avatar Aug 06 '24 21:08 sirizake

Hi, Adding support for this as per conversations on the email lists. We have experienced a very similar issue when there are issues with the upstream forwarder (responding with negative responses - often due to community block-list churn or otherwise), and also when the internet has simply failed.

From our experience it appears that the forwarder/recurse/prefetch logic paths are able to poison cache records with negative responses.

We are looking for configuration options to configure Unbound to never cache negative responses. We are aware you can limit the ttl of negative caches, which provides some of this. However poisoning the cache works against the serve-stale and RFC 8767 goodness.

Will try and post some explicit examples to reproduce when I find time. Thanks again for all the great work :)

andylemin avatar Aug 07 '24 22:08 andylemin

Unbound generates negative responses from cache, using aggressive-nsec and with harden-below-nxdomain. They can be turned off. The harden-below-nxdomain regularly turns up as a problem, with own use local domains where upper labels of that receive an NXDOMAIN answer from the forwarder or recursor. Not sure if that is the issue, but perhaps worth looking at. Another thing that complicates recursor logic and could emit negatives is the query minimisation, qname-minimisation, that creates queries for upper labels. It can also be turned off, by setting it to no.

wcawijngaards avatar Aug 08 '24 09:08 wcawijngaards