ingress-nginx nginx-ingress pod crashloop if it can not redownload GeoIP2 database

NGINX Ingress controller version : 1.1.0

Kubernetes version : v1.21.7

Environment:

Cloud provider or hardware configuration: BareMetal
OS (e.g. from /etc/os-release): Ubuntu
Kernel (e.g. uname -a): 5.4.0-91-generic
How was the ingress-nginx-controller installed: helm ingress-nginx-4.0.13 1.1.0

What happened: If for any reason during the pod start nginx-ingress-controller fails to download the geoip2 database from https://download.maxmind.com/app/geoip_download... the pod will crashloop even if there is a previous database already downloaded and present in /etc/nginx/geoip using a persistent volume.

What you expected to happen: It should fail to start only if it can not download the geoip2 database AND it can not find a previous downloaded database on /etc/nginx/geoip. In the current way, If there is any problem with maxmind website, the ingresses will refuse to start.

How to reproduce it:

Deploy nginx-ingress to use a persistent volume mounted at /etc/nginx/geoip/ and also set the maxmindLicenseKey.
After the first successfully run, the geoip2 databases will be on /etc/nginx/geoip/
Cut access to download.maxmind.com using a firewall to simulate a failure on their website
Delete the nginx ingress pod to force it to be recreated. It will never start until it can connect to download.maxmind.com.

/kind bug

Dec 20 '21 12:12 jsalatiel

@jsalatiel: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 20 '21 12:12 k8s-ci-robot

/remove-kind bug /kind support

Please add more information. It will help to elaborate why its a bug in the ingress-nginx controller, if download of the controller image or the download of a dependency like geoip db is broken due to firewall

Dec 22 '21 11:12 longwuyuan

The container image is download and run, but in the logs it shows failing downloading the geoip db and thus the container will never be in ready state.

Dec 24 '21 00:12 jsalatiel

If your firewall is blocking the download, its obviously not a bug in the controller. Can you elaborate what you want the controller to do if your firewall is blocking the download of the geoip db.

Dec 24 '21 05:12 longwuyuan

Hi @longwuyuan , sorry but I probably not expressed myself the right way. I used the firewall example just to show how the bug could be replicated. I have no firewall blocking that. The real problem is: In the last couple months, for 3 times the download.maxmind.com URL was down while I was doing some maintenance in my Cluster and due to that the nginx controller was not able to start EVEN though it had already downloaded the databases on previous run and they were saved on a persistent volume. If a previous database exists AND nginx can not update those database on container start, that should not be fatal IMHO.

Dec 24 '21 13:12 jsalatiel

That wil require some research as to how maxmind database use & updates work in the controller. I think your expectation could be valid but I don't know for sure. Would you be able to submit a PR.

Thanks, ; Long Wu Yuan

On 12/24/21 7:11 PM, jsalatiel wrote:

Hi @longwuyuan https://github.com/longwuyuan , sorry but I probably not expressed myself the right way. I used the firewall example just to show how the bug could be replicated. I have no firewall blocking that. The real problem is: In the last couple months, for 3 times the download.maxmind.com URL was down while I was doing some maintenance in my Cluster and due to that the nginx controller was not able to start EVEN though it had already downloaded the databases on previous run and they were saved on a persistent volume. If a previous database exists AND nginx can not update those database on container start, that should not be fatal IMHO.

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/8059#issuecomment-1000843833, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWSRF7XPN477E2GG3KLUSR2ABANCNFSM5KNTBWHA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Dec 24 '21 17:12 longwuyuan

@theunrealgeek, have you worked with maxmind geoip db internals before

Dec 25 '21 04:12 longwuyuan

have you worked with maxmind geoip db internals before Not really with the internals of Maxmind db itself.

But I think the problem here is some place else, probably in the controller code before nginx starts up where the initial download is happening. I haven’t looked at that part of the code yet.

Dec 25 '21 04:12 theunrealgeek

Updating the GeoIP db seems to be in the critical path and I suppose that is absolutely correct because the alternative is to bring up the controller with a potentially outdated GeoIP db.

Then its like running the controller with both a updated GeoIP db if possible and ALSO with a outdated GeoIP db as a alternative which is basically breaking all the features which depend on GeoIP.

Personally I don't think its a bug at all and just like the network being healthy being a requirement, the update of GeoIP db on each init is a hard requirement and a fail on that should stop the controller from running.

But all above are just opinions and I could be completely wrong as I have no exposure to GeoIP.

Thanks, ; Long Wu Yuan

On 12/25/21 10:25 AM, Aditya Kamath wrote:

have you worked with maxmind geoip db internals before
Not really with the internals of Maxmind db itself.
But I think the problem here is some place else, probably in the controller code before nginx starts up where the initial download is happening. I haven’t looked at that part of the code yet.

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/8059#issuecomment-1000975041, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWVLCOJOTZIESTGERELUSVFDZANCNFSM5KNTBWHA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Dec 25 '21 05:12 longwuyuan

I don't think running with a potentially outdated geoip is really a problem. If that geoip is there in the persistent volume it means that it is the one that was actively in use before the nginx pod gets restarted/killed. Also since there is a database there, any nginx rules relying on geoip will work anyways. And this is much better than being unable to recover your cluster after a node failure ( without disabling geoip download) if maxmind ( external site ) is offline. Of course this is also just my opinion. I cant code to submit a PR.

Dec 25 '21 12:12 jsalatiel

There are multiple aspects here.

(1) Running the controller only after a fresh download of GeoIP-DB, hence ensuring the latest iplist.

(2) Attempting a update of the GeoIP-DB.

(3) Running the pod even on a failed attempt, with a assumption that the previous successful update attempt is good enough to continue.

(3) Existence or absense of a copy of the latest updated GeoIP-DB in some accessible location.

I too can not code so I can't attempt a PR here. @theunrealgeek will help out with thoughts on the code.

I am not in support of changing code, when its not broken. So far there is no data to show GeoIP-DB related broken controller for multiple users of GeoIP-DB, because no other user has reported it.

Thanks, ; Long Wu Yuan

On 12/25/21 6:26 PM, jsalatiel wrote:

I don't think running with a potentially outdated geoip is really a problem. If that geoip is there in the persistent volume it means that it is the one that was actively in use before the nginx pod gets restarted/killed. Also since there is a database there, any nginx rules relying on geoip will work anyways. And this is much better than being unable to recover your cluster after a node failure ( without disabling geoip download) if maxmind ( external site ) is offline. Of course this is also just my opinion. I cant code to submit a PR.

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/8059#issuecomment-1001014249, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWWJRLXOQ33CEH3O5GDUSW5P7ANCNFSM5KNTBWHA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Dec 25 '21 13:12 longwuyuan

Running into this same issue when running locally for dev after upgrading from 0.45.0 to 1.1.1.

This issue started in release 0.49.0 - https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v0.49.0. Issue https://github.com/kubernetes/ingress-nginx/pull/7242 looks to maybe be the cause.

0.45.0 - Locally, we define a dummy maxmind license. You can see that though nginx did error, it didn't crash, and it continues to boot without problems.

flags.go:310] "downloading maxmind GeoIP2 databases"
flags.go:312] "unexpected error downloading GeoIP2 database" err="HTTP status 401 Unauthorized"
store.go:886] The GeoIP2 feature is enabled but the databases are missing. Disabling

1.1.1 - Same set up and configuration, just updated version. The pod crashes and keeps looping. If I remove the --maxmind-license-key it starts without issue.

I0125 20:17:01.611725       8 flags.go:344] "downloading maxmind GeoIP2 databases"
E0125 20:17:01.789163       8 flags.go:346] "unexpected error downloading GeoIP2 database" err="HTTP status 401 Unauthorized"
F0125 20:17:01.789425       8 main.go:67] HTTP status 401 Unauthorized
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
    k8s.io/klog/[email protected]/klog.go:1026 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x28a9ac0, 0x3, {0x0, 0x0}, 0xc0004138f0, 0x1, {0x1f86a9b, 0xc000062400}, 0xc00001ba70, 0x0)
    k8s.io/klog/[email protected]/klog.go:975 +0x63d
k8s.io/klog/v2.(*loggingT).printDepth(0x1, 0x1, {0x0, 0x0}, {0x0, 0x0}, 0x445131, {0xc00001ba70, 0x1, 0x1})
    k8s.io/klog/[email protected]/klog.go:735 +0x1ba
k8s.io/klog/v2.(*loggingT).print(...)
    k8s.io/klog/[email protected]/klog.go:717
k8s.io/klog/v2.Fatal(...)
    k8s.io/klog/[email protected]/klog.go:1494
main.main()
    k8s.io/ingress-nginx/cmd/nginx/main.go:67 +0x1d3

goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
    k8s.io/klog/[email protected]/klog.go:1169 +0x6a
created by k8s.io/klog/v2.init.0
    k8s.io/klog/[email protected]/klog.go:420 +0xfb

goroutine 15 [IO wait]:
internal/poll.runtime_pollWait(0x7f2f773eb530, 0x72)
    runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc00015eb00, 0xc0000de480, 0x0)
    internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
    internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00015eb00, {0xc0000de480, 0x1a52, 0x1a52})
    internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00015eb00, {0xc0000de480, 0xc0000de57f, 0x1a})
    net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc000418028, {0xc0000de480, 0x417186, 0xc0004cd7f8})
    net/net.go:183 +0x45
crypto/tls.(*atLeastReader).Read(0xc0002b20c0, {0xc0000de480, 0x0, 0x409dcd})
    crypto/tls/conn.go:777 +0x3d
bytes.(*Buffer).ReadFrom(0xc00003c5f8, {0x1b70980, 0xc0002b20c0})
    bytes/buffer.go:204 +0x98
crypto/tls.(*Conn).readFromUntil(0xc00003c380, {0x1b73000, 0xc000418028}, 0x1958)
    crypto/tls/conn.go:799 +0xe5
crypto/tls.(*Conn).readRecordOrCCS(0xc00003c380, 0x0)
    crypto/tls/conn.go:606 +0x112
crypto/tls.(*Conn).readRecord(...)
    crypto/tls/conn.go:574
crypto/tls.(*Conn).Read(0xc00003c380, {0xc0000fe000, 0x1000, 0xc000402b60})
    crypto/tls/conn.go:1277 +0x16f
bufio.(*Reader).Read(0xc000396e40, {0xc0004c88f8, 0x9, 0xc0003fede0})
    bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x1b707e0, 0xc000396e40}, {0xc0004c88f8, 0x9, 0x9}, 0x9)
    io/io.go:328 +0x9a
io.ReadFull(...)
    io/io.go:347
net/http.http2readFrameHeader({0xc0004c88f8, 0x9, 0xc000094db0}, {0x1b707e0, 0xc000396e40})
    net/http/h2_bundle.go:1555 +0x6e
net/http.(*http2Framer).ReadFrame(0xc0004c88c0)
    net/http/h2_bundle.go:1813 +0x95
net/http.(*http2clientConnReadLoop).run(0xc0004cdf98)
    net/http/h2_bundle.go:8608 +0x130
net/http.(*http2ClientConn).readLoop(0xc000492480)
    net/http/h2_bundle.go:8531 +0x6f
created by net/http.(*http2Transport).newClientConn
    net/http/h2_bundle.go:7325 +0xb85

Jan 25 '22 20:01 NeckBeardPrince

The current behavior is to fatal when Maxmind license is defined and the database cannot be downloaded. There are various reasons this is desirable, such as compliance requirements that certain countries be blocked, etc.

One way to work around a Maxmind outage is to run a Maxmind mirror or caching proxy for your organization. This is recommended as a best practice by Maxmind to minimize load on their infrastructure. It works well for us.

Another solution could be some way for you to declare that download failures are allowed in your systems, or allowed only when a file is already present. Maybe a new config option could be added to allow you to define this policy.

Jan 26 '22 06:01 kd7lxl

The current behavior is to fatal when Maxmind license is defined and the database cannot be downloaded. There are various reasons this is desirable, such as compliance requirements that certain countries be blocked, etc.

I don't believe that, when the database was unable to be downloaded, nginx would disable Maxmind in 0.45.0, without fatal.

This The GeoIP2 feature is enabled but the databases are missing. Disabling message alone does not support your claim. As I pointed out above in 0.49.0 a new feature https://github.com/kubernetes/ingress-nginx/pull/7242 to retry Maxmind downloads was added. This feature would output a message saying that it was retrying the download maxmind.go:114] "download failed on attempt 1". In the error I posted above, that doesn't even show up.

My post https://github.com/kubernetes/ingress-nginx/issues/8059#issuecomment-1021573464 does not show a controlled exit of the process, but a full crash and stacktrace. Why would a stacktrace be the desired behavior?

Jan 26 '22 09:01 NeckBeardPrince

The current behavior is to fatal when Maxmind license is defined and the database cannot be downloaded. There are various reasons this is desirable, such as compliance requirements that certain countries be blocked, etc.

I don't believe that, when the database was unable to be downloaded, nginx would disable Maxmind in 0.45.0, without fatal.

This The GeoIP2 feature is enabled but the databases are missing. Disabling message alone does not support your claim. As I pointed out above in 0.49.0 a new feature #7242 to retry Maxmind downloads was added. This feature would output a message saying that it was retrying the download maxmind.go:114] "download failed on attempt 1". In the error I posted above, that doesn't even show up.

There was no retry attempt in your test case because 401 Unauthorized is an unrecoverable error. 4xx errors mean the client has done something wrong and needs to fix it. In this case, the license key is wrong. No amount of retrying will fix that, so it doesn't retry this case. This is best practice when writing API clients. The desired and implemented behavior for this case in ingress-nginx versions 1.x is to print an error and exit. Yes, this behavior changed as part of the v0.x to v1.x release.

My post #8059 (comment) does not show a controlled exit of the process, but a full crash and stacktrace. Why would a stacktrace be the desired behavior?

It does indeed make a controlled call to Fatal(). Your stacktrace indicates this happened on line 67 of main.go, which is easy to read:

https://github.com/kubernetes/ingress-nginx/blob/2aa34202c1ae42fc689cc6980817aedc80b75229/cmd/nginx/main.go#L66-L67

Stacktraces are desirable so that you can trace the problem back in the code. Sometimes reading the relevant line of code can help you understand how to change your config to a valid one.

What did you think about my idea to allow configuring a Maxmind DB failure policy?

Jan 26 '22 16:01 kd7lxl

"Another solution could be some way for you to declare that download failures are allowed in your systems, or allowed only when a file is already present. Maybe a new config option could be added to allow you to define this policy."

For me the best solution would be this one.

Jan 27 '22 11:01 jsalatiel

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 27 '22 12:04 k8s-triage-robot

/remove-lifecycle stale

Apr 27 '22 13:04 NeckBeardPrince

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 26 '22 13:07 k8s-triage-robot

/remove-lifecycle stale

Jul 27 '22 00:07 jsalatiel

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 25 '22 00:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Nov 24 '22 01:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Dec 24 '22 01:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 24 '22 01:12 k8s-ci-robot