nginx-gateway-fabric
nginx-gateway-fabric copied to clipboard
NGINX Data Plane intermittently reports "no live upstreams" despite pods being healthy
Description: We are experiencing intermittent errors in NGINX Data Plane logs, reporting no live upstreams while connecting to upstream, even though all pods are healthy and running.
Details:
- The error appears suddenly for all connections routed through NGINX.
- Pods are not new; the most recent pod has been running for 12 hours.
- CPU and memory usage of both the pods and NGINX are well below requests/limits.
- Port-forwarding to both the service and individual pods works fine, indicating the pods are indeed reachable.
- No recent deployments or configuration changes occurred.
The issue is transient but affects all traffic to the NGINX service when it occurs.
Example log (IP addresses removed for privacy):
[error] no live upstreams while connecting to upstream, client: <removed>, server: ~^, request: "GET /apis/primetime/api/v1/ads/adrequest/criteria/...", upstream: "http://prd-apps_primetime_4000/api/v1/ads/adrequest/criteria/..."
Observed behavior:
NGINX behaves as if no pods are available, even when they are healthy.
The problem is solved spontaneously or after a restart of NGINX Data Plane Deployment.
Expected behavior:
NGINX should consistently route requests to the available pods without reporting no live upstreams.
Questions / Investigation:
Could this be related to NGINX upstream health checks?
Could there be an ephemeral network or connection tracking issue?**