nginx-gateway-fabric icon indicating copy to clipboard operation
nginx-gateway-fabric copied to clipboard

Errors from NGINX after unrelated application deployment

Open ryan-dyer-sp opened this issue 2 months ago • 11 comments

Describe the bug First off I've been using nginx ingress controller for years and never seen anything like this. Have been using NGF a few months now and we've started seeing a very odd behavior where one of our apps gets deployed and coincidentally the NGF starts throwing errors that never go away until the NGF controller pod is restarted. However once that happens sometimes NGF seems to want to redeploy all the gateway pods which results in all of our services having a very brief outage; even though we run multiple pods per gateway.

To Reproduce Steps to reproduce the behavior: Have not been able to come up with a reproduction scenario. We deploy applications multiple times a day without issue. But on occasion things break.

Expected behavior Application deployments dont put NGF into a bad state.

Your environment

  • Version of the NGINX Gateway Fabric. - 2.1.4 - 2d999117b228356b432ccb1e07bd0ceca7b9f2ef

  • Version of Kubernetes - 1.32.9-eks

  • Kubernetes platform (e.g. Mini-kube or GCP) - eks

  • Details on how you expose the NGINX Gateway Fabric Pod (e.g. Service of type LoadBalancer or port-forward) The gateways are exposed as AWS NLBs (LoadBalancer)

  • Logs Image The error waiting for connection errors tend to occur every 30-40s once the configuration error one is seen. LMK if you need full logs from around the time it occurs.

ryan-dyer-sp avatar Oct 13 '25 21:10 ryan-dyer-sp

Hi @ryan-dyer-sp! Welcome to the project! 🎉

Thanks for opening this issue! Be sure to check out our Contributing Guidelines and the Issue Lifecycle while you wait for someone on the team to take a look at this.

nginx-bot[bot] avatar Oct 13 '25 21:10 nginx-bot[bot]

Hey @ryan-dyer-sp we would greatly appreciate the full logs from the time it occurs to help debug this issue faster. Meanwhile, we will try to reproduce this scenario and keep you updated.

salonichf5 avatar Oct 13 '25 21:10 salonichf5

Not a lot here but hope it helps

: "37 lines displayed"
Total bytes processed: "65.9  kB"
Common labels: {"app_kubernetes_io_name":"nginx-gateway","cluster":"prod","container":"nginx-gateway","instance":"nginx-gateway-nginx-gateway","job":"integrations/nginx","k8s_cluster_name":"prod","namespace":"nginx-gateway","service_name":"nginx-gateway"}


1760386839222	2025-10-13T20:20:39.222Z	{"level":"info","ts":"2025-10-13T20:20:39Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-domain-nginx"}
1760386839214	2025-10-13T20:20:39.214Z	{"level":"info","ts":"2025-10-13T20:20:39Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}
1760386839209	2025-10-13T20:20:39.209Z	{"level":"info","ts":"2025-10-13T20:20:39Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-http01-certs-nginx"}
1760386824859	2025-10-13T20:20:24.859Z	{"level":"error","ts":"2025-10-13T20:20:24Z","logger":"nginxUpdater.commandService","msg":"error waiting for connection","error":"timed out waiting for agent to register nginx","stacktrace":"github.com/nginx/nginx-gateway-fabric/v2/internal/controller/nginx/agent.(*commandService).Subscribe\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/controller/nginx/agent/command.go:138\ngithub.com/nginx/agent/v3/api/grpc/mpi/v1._CommandService_Subscribe_Handler\n\tpkg/mod/github.com/nginx/agent/[email protected]/api/grpc/mpi/v1/command_grpc.pb.go:233\ngithub.com/nginx/nginx-gateway-fabric/v2/internal/controller/nginx/agent/grpc/interceptor.(*ContextSetter).Stream.ContextSetter.Stream.func1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/controller/nginx/agent/grpc/interceptor/interceptor.go:65\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tpkg/mod/google.golang.org/[email protected]/server.go:1728\ngoogle.golang.org/grpc.(*Server).handleStream\n\tpkg/mod/google.golang.org/[email protected]/server.go:1845\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tpkg/mod/google.golang.org/[email protected]/server.go:1061"}
1760386824187	2025-10-13T20:20:24.187Z	{"level":"info","ts":"2025-10-13T20:20:24Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"internal-domain-nginx"}
1760386824186	2025-10-13T20:20:24.186Z	{"level":"info","ts":"2025-10-13T20:20:24Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"internal-nginx"}
1760386824186	2025-10-13T20:20:24.186Z	{"level":"info","ts":"2025-10-13T20:20:24Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-domain-nginx"}
1760386809159	2025-10-13T20:20:09.159Z	{"level":"info","ts":"2025-10-13T20:20:09Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}
1760386809159	2025-10-13T20:20:09.159Z	{"level":"info","ts":"2025-10-13T20:20:09Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"internal-domain-nginx"}
1760386809155	2025-10-13T20:20:09.155Z	{"level":"info","ts":"2025-10-13T20:20:09Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"internal-nginx"}
1760386794131	2025-10-13T20:19:54.131Z	{"level":"info","ts":"2025-10-13T20:19:54Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-domain-nginx"}
1760386794024	2025-10-13T20:19:54.024Z	{"level":"error","ts":"2025-10-13T20:19:54Z","logger":"nginxUpdater.commandService","msg":"error setting initial configuration","error":"rpc error: code = Canceled desc = context canceled","stacktrace":"github.com/nginx/nginx-gateway-fabric/v2/internal/controller/nginx/agent.(*commandService).setInitialConfig\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/controller/nginx/agent/command.go:307\ngithub.com/nginx/nginx-gateway-fabric/v2/internal/controller/nginx/agent.(*commandService).Subscribe\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/controller/nginx/agent/command.go:149\ngithub.com/nginx/agent/v3/api/grpc/mpi/v1._CommandService_Subscribe_Handler\n\tpkg/mod/github.com/nginx/agent/[email protected]/api/grpc/mpi/v1/command_grpc.pb.go:233\ngithub.com/nginx/nginx-gateway-fabric/v2/internal/controller/nginx/agent/grpc/interceptor.(*ContextSetter).Stream.ContextSetter.Stream.func1\n\t/home/runner/work/nginx-gateway-fabric/nginx-gateway-fabric/internal/controller/nginx/agent/grpc/interceptor/interceptor.go:65\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tpkg/mod/google.golang.org/[email protected]/server.go:1728\ngoogle.golang.org/grpc.(*Server).handleStream\n\tpkg/mod/google.golang.org/[email protected]/server.go:1845\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tpkg/mod/google.golang.org/[email protected]/server.go:1061"}
1760386794022	2025-10-13T20:19:54.022Z	{"level":"info","ts":"2025-10-13T20:19:54Z","logger":"nginxUpdater.commandService","msg":"Sending initial configuration to agent","pod":"internal-nginx-5c89557bbf-rpjfx","configVersion":"uT/cyUjEdzvC6XVdaLh5phVAOSM2xESKOEsWwedFtng="}
1760386794014	2025-10-13T20:19:54.014Z	{"level":"info","ts":"2025-10-13T20:19:54Z","logger":"nginxUpdater.commandService","msg":"Successfully connected to nginx agent internal-nginx-5c89557bbf-rpjfx"}
1760386792998	2025-10-13T20:19:52.998Z	{"level":"info","ts":"2025-10-13T20:19:52Z","logger":"nginxUpdater.commandService","msg":"Creating connection for nginx pod: internal-nginx-5c89557bbf-rpjfx"}
1760386779093	2025-10-13T20:19:39.093Z	{"level":"info","ts":"2025-10-13T20:19:39Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-http01-certs-nginx"}
1760386779089	2025-10-13T20:19:39.089Z	{"level":"info","ts":"2025-10-13T20:19:39Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}
1760386779088	2025-10-13T20:19:39.088Z	{"level":"info","ts":"2025-10-13T20:19:39Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-domain-nginx"}
1760386779081	2025-10-13T20:19:39.081Z	{"level":"info","ts":"2025-10-13T20:19:39Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"internal-nginx"}
1760386764060	2025-10-13T20:19:24.060Z	{"level":"info","ts":"2025-10-13T20:19:24Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}
1760386749021	2025-10-13T20:19:09.021Z	{"level":"info","ts":"2025-10-13T20:19:09Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}
1760386733983	2025-10-13T20:18:53.983Z	{"level":"info","ts":"2025-10-13T20:18:53Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}
1760386733983	2025-10-13T20:18:53.983Z	{"level":"info","ts":"2025-10-13T20:18:53Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-http01-certs-nginx"}
1760386733975	2025-10-13T20:18:53.975Z	{"level":"info","ts":"2025-10-13T20:18:53Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-domain-nginx"}
1760386733975	2025-10-13T20:18:53.975Z	{"level":"info","ts":"2025-10-13T20:18:53Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"internal-domain-nginx"}
1760386718908	2025-10-13T20:18:38.908Z	{"level":"info","ts":"2025-10-13T20:18:38Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}
1760386703843	2025-10-13T20:18:23.843Z	{"level":"info","ts":"2025-10-13T20:18:23Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-domain-nginx"}
1760386703833	2025-10-13T20:18:23.833Z	{"level":"info","ts":"2025-10-13T20:18:23Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-http01-certs-nginx"}
1760386703820	2025-10-13T20:18:23.820Z	{"level":"info","ts":"2025-10-13T20:18:23Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"internal-domain-nginx"}
1760386703815	2025-10-13T20:18:23.815Z	{"level":"info","ts":"2025-10-13T20:18:23Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}
1760386688780	2025-10-13T20:18:08.780Z	{"level":"info","ts":"2025-10-13T20:18:08Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-domain-nginx"}
1760386688758	2025-10-13T20:18:08.758Z	{"level":"info","ts":"2025-10-13T20:18:08Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-http01-certs-nginx"}
1760386688757	2025-10-13T20:18:08.757Z	{"level":"info","ts":"2025-10-13T20:18:08Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"internal-nginx"}
1760386688741	2025-10-13T20:18:08.741Z	{"level":"info","ts":"2025-10-13T20:18:08Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}
1760386673684	2025-10-13T20:17:53.684Z	{"level":"info","ts":"2025-10-13T20:17:53Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-http01-certs-nginx"}
1760386673681	2025-10-13T20:17:53.681Z	{"level":"info","ts":"2025-10-13T20:17:53Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"internal-nginx"}
1760386673678	2025-10-13T20:17:53.678Z	{"level":"info","ts":"2025-10-13T20:17:53Z","logger":"provisioner","msg":"Creating/Updating nginx resources","namespace":"nginx-gateway","name":"external-nginx"}

ryan-dyer-sp avatar Oct 13 '25 21:10 ryan-dyer-sp

Could someone please enlighten as to why a restart of a controller pod (even when running more than 1) results in all proxy/data pods getting cycled? This seems fragile.

ryan-dyer-sp avatar Oct 13 '25 21:10 ryan-dyer-sp

Would it be possible to share some of the agent logs from when you encounter this issue? It’s quite difficult for us to diagnose the problem based solely on the current logs.

You can enable agent-level logging while running the deployments — having those logs will give us much better visibility into what’s happening during the connection to the agent and should help us pinpoint the root cause.

Here’s how you can enable it:

k edit nginxproxies.gateway.nginx.org -n nginx-gateway nginx-gateway-proxy-config

spec:
  service:
    externalTrafficPolicy: Local
    type: ClusterIP
  logging:
    agentLevel: debug // agent logs
    errorLevel: debug  // nginx logs

@ryan-dyer-sp

Could someone please enlighten as to why a restart of a controller pod (even when running more than 1) results in all proxy/data pods getting cycled? This seems fragile.

I believe this behavior is intentional — it’s designed to ensure that the control plane and data plane remain in sync when a controller restarts. This reconciliation step was introduced as part of the architecture changes we made in NGF to keep the system state consistent.

That said, we understand that this can feel fragile, and we’re actively discussing how to address it and what priority it should have. Hopefully, we’ll have more clarity and updates to share soon.

In the meantime, if you’re able to provide some agent logs from when this happens, that would be extremely helpful for our investigation.

salonichf5 avatar Oct 15 '25 16:10 salonichf5

@salonichf5 Thanks for the explanation on why everything gets restarted. I have enabled the debug logs and will continue to monitor our clusters for when this happens again. This does not happen often, so it may take a week or more before we encounter the issue again. But will update the issue with logs once we do.

Thanks!

ryan-dyer-sp avatar Oct 15 '25 18:10 ryan-dyer-sp

In general, a restart of the controller should not result in a restart of the data plane pods. This should only occur if the controller has patched the data plane deployment with a change that results in it being restarted (change in the Deployment spec). So something must be occurring where the controller is restarted, and the new controller has a different Deployment config for the data plane than the previous controller did, and submits the patch.

This most commonly occurs during a version upgrade where the new controller has a new version of the data plane Deployment.

sjberman avatar Oct 15 '25 19:10 sjberman

@salonichf5 We have hit the issue again this morning, but the nginx-gateway logs dont have any additional information other than what I originally posted (no debug type logs). Can you confirm this is the correct configuration? This is in the nginxproxy/nginx-gateway-proxy-config

spec:
  ipFamily: dual
  kubernetes:
    deployment:
      container:
        image:
          pullPolicy: IfNotPresent
          repository: ghcr.io/nginx/nginx-gateway-fabric/nginx
          tag: 2.1.4
      replicas: 1
    service:
      externalTrafficPolicy: Local
      type: LoadBalancer
  logging:
    agentLevel: debug
    errorLevel: debug

The errors are still coming, so if there is some sort of logging I can enable at this point perhaps I can capture w/e logs would be useful

@sjberman This seems fairly normal. This morning while looking at the config above, i noticed I had manually set the gateway replicas to 2 (k scale --replicas deploy/nginx-gateway), but the proxy config was defined for 1. So, I updated this value alone. Shortly thereafter 2 of the 5 gateways we run in the cluster had their pods restarted. The nginx-gateway controller pods didnt restart, half of our data planes did. We seem to encounter this scenario a lot where our data planes are getting restarted for no obvious reason. Should I open a different issue to perhaps try and track this down further or is it expected to be this common?

Thanks!

ryan-dyer-sp avatar Oct 17 '25 13:10 ryan-dyer-sp

Also found the following on one of our gateways

internal-domain-nginx-966784758-smx7t nginx 2025/10/17 13:30:14 [info] 70955#70955: *224213 client closed connection while SSL handshaking, client: 10.3.166.232, server: 0.0.0.0:443
internal-domain-nginx-966784758-smx7t nginx time=2025-10-17T13:30:15.545Z level=ERROR msg="Failed toreceive message from subscribe stream" error="rpc error: code = Unknown desc = timed out waiting for agent to register nginx" correlation_id=9c8d2cdf-ab59-11f0-bcc4-cae952481999 server_type=command
internal-domain-nginx-966784758-smx7t nginx 2025/10/17 13:30:17 [info] 70955#70955: *224214 client closed connection while waiting for request, client: 10.3.166.232, server: 0.0.0.0:80
internal-domain-nginx-966784758-smx7t nginx 2025/10/17 13:30:17 [info] 70955#70955: *224215 client closed connection while SSL handshaking, client: 10.3.166.232, server: 0.0.0.0:443

The IP in question is from the NLB sitting in front.

ryan-dyer-sp avatar Oct 17 '25 13:10 ryan-dyer-sp

I think I misunderstood where debug logs were getting enabled. I see from the pod posted above these errors right before the controller starts sending errors.

time=2025-10-17T13:03:02.668Z level=ERROR msg="Failed toreceive message from subscribe stream" error="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (1510614070 vs. 4194304)" correlation_id=9c8d2cdf-ab59-11f0-bcc4-cae952481999 server_type=command

Also perhaps interestingly is this pod does not appear to have any sort of debug logs, while other gateways did seem to get debug logging enabled.

ryan-dyer-sp avatar Oct 17 '25 15:10 ryan-dyer-sp

@ryan-dyer-sp Sorry you're encountering these issues, and so sorry for the delay in replying!

This morning while looking at the config above, i noticed I had manually set the gateway replicas to 2 (k scale --replicas deploy/nginx-gateway), but the proxy config was defined for 1.

As you have correctly identified, the control plane and data planes scale independently from one another. Once you modify the NginxProxy, the control planes should not restart, but the dataplane deployments will update to reflect the new configurations.

time=2025-10-17T13:03:02.668Z level=ERROR msg="Failed toreceive message from subscribe stream" error="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (1510614070 vs. 4194304)" correlation_id=9c8d2cdf-ab59-11f0-bcc4-cae952481999 server_type=command

This is very helpful thank you, I'm looking into this in more detail now.

ciarams87 avatar Oct 24 '25 14:10 ciarams87

I believe we finally have reproduced the issue again. We have the same error messages in the gateway controller. Though in this case I do not see any underlying deployment which coincides with the timing. Nor do I see anything like a node scale up/down action. Just errors in the logs all of a sudden.

{"level":"error","ts":"2025-11-18T13:02:20Z","logger":"nginxUpdater.commandService","msg":"error waiting for connection","error":"timed out waiting for agent to register nginx","stacktrace":"github.com/nginx/nginx-gateway-fabric/v2/internal/controller/nginx/agent.(*commandService).Subscribe\n\t/opt/actions-runner/_work/nginx-gateway-fabric/nginx-gateway-fabric/internal/controller/nginx/agent/command.go:138\ngithub.com/nginx/agent/v3/api/grpc/mpi/v1._CommandService_Subscribe_Handler\n\t/opt/actions-runner/_work/nginx-gateway-fabric/nginx-gateway-fabric/.gocache/github.com/nginx/agent/[email protected]/api/grpc/mpi/v1/command_grpc.pb.go:233\ngithub.com/nginx/nginx-gateway-fabric/v2/internal/controller/nginx/agent/grpc/interceptor.(*ContextSetter).Stream.ContextSetter.Stream.func1\n\t/opt/actions-runner/_work/nginx-gateway-fabric/nginx-gateway-fabric/internal/controller/nginx/agent/grpc/interceptor/interceptor.go:65\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\t/opt/actions-runner/_work/nginx-gateway-fabric/nginx-gateway-fabric/.gocache/google.golang.org/[email protected]/server.go:1729\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/opt/actions-runner/_work/nginx-gateway-fabric/nginx-gateway-fabric/.gocache/google.golang.org/[email protected]/server.go:1846\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\t/opt/actions-runner/_work/nginx-gateway-fabric/nginx-gateway-fabric/.gocache/google.golang.org/[email protected]/server.go:1061"}

Also when I restarted this gateway pod(which has gotten rid of the error message) every other nginx gateway pod restarted. This seems odd given nothing has changed which I think should trigger this based on the info provided.

These are the logs (first 1k) from around the time when the issue started happening. https://gist.github.com/ryan-dyer-sp/b792a5d887255b4d1b4bcd105f406d38

LMK if you need more or better filtering (we keep them in grafana so can hopefully prune to more accurate of what is useful).

ryan-dyer-sp avatar Nov 18 '25 13:11 ryan-dyer-sp

Also when I restarted this gateway pod(which has gotten rid of the error message) every other nginx gateway pod restarted.

I'm assuming you restarted the control plane, and then when the new one came up, the nginx data plane pods all restarted. This is definitely an issue we need to root cause (maybe a separate ticket makes more sense for this).

Regarding the logs, do you have the logs from the nginx pod at the same time as the log you provided? The gist contains logs up to 13:00, but the error seems to have occurred at 13:02. Hoping to cross-reference the logs at the same time between the two deployments. Having more logs from the control plane around that time would be helpful as well.

sjberman avatar Nov 18 '25 20:11 sjberman