mimir icon indicating copy to clipboard operation
mimir copied to clipboard

DNS resolution in Helm-generated NGINX config does not work as expected

Open Packetslave opened this issue 2 years ago • 1 comments

The generated nginx.conf file from the Mimir Helm chart has sections such as this:

resolver 1.2.3.4;

location /distributor {
    proxy_pass      http://my-distributor.namespace.service.cluster.local:8080$request_uri;
}

While this works, it seems to have a sneaky bug that can (and did) easily cause an outage:

According to this article, the above config will only resolve the IPs of the distributor when nginx starts or reloads, even if the A records have a TTL of 5 seconds. This is obviously not ideal since if a pod moves off a host, its IP is likely to change. It's also very unexpected behavior unless you're already very familiar with NGINX name resolution behavior.

According to the article, the above stanza does work as expected in NGINX Plus (DNS records in proxy_pass statements are resolved according to their TTL). It does not work in open-source NGINX unless something has changed very recently.

The fix is to generate the above section like so:

resolver 1.2.3.4;

location /distributor {
    set $BACKEND my-distributor.namespace.service.cluster.local;
    proxy_pass      http://$BACKEND:8080$request_uri;
}

Using a variable causes NGINX to re-resolve the backend IPs when the TTL expires.

Packetslave avatar Dec 20 '22 22:12 Packetslave

Hi @Packetslave thanks for pointing this out. Do you want to put up a PR for the fix? We would be happy to review it. If not someone on the Mimir team can address it.

aldernero avatar Jan 09 '23 15:01 aldernero