gateway
gateway copied to clipboard
504 - upstream connect error or disconnect/reset before headers. reset reason: connection timeout
Using Envoy Gateway 1.0.1. on our development cluster. All HttpRoutes work, except one route for a React Frontend. We are getting a 504 error on the client and a text saying upstream connect error or disconnect/reset before headers. reset reason: connection timeout which does not look like an error our app would output.
What we have already tested:
- deployment is running
- k8s port forward of one of the containers -> works
- k8s port-forward of the service -> works
- testing with different ports -> did not help
- tesing of HttpRoute pointing to a different service -> works
- checking NGINX logs -> We don't see any logs from the NGINX container when this error happens.
Here's our setup
Please note I have replaced IP's, domains, and namespaces with something I can share publicly. In case you find some naming error it might have been a typo while replacing our internal names.
curl
# curl -v webapp.mydomain.mytld
* Host webapp.mydomain.mytld:80 was resolved.
* IPv6: (none)
* IPv4: xxx.xxx.xxx.xxx, yyy.yyy.yyy.yyy
* Trying xxx.xxx.xxx.xxx:80...
* Connected to webapp.mydomain.mytld (xxx.xxx.xxx.xxx) port 80
> GET / HTTP/1.1
> Host: webapp.mydomain.mytld
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 504 Gateway Timeout
< content-length: 24
< content-type: text/plain
< date: Mon, 26 Aug 2024 11:09:27 GMT
<
* Connection #0 to host webapp.mydomain.mytld left intact
upstream request timeout%
Error Message from envoy gateway logs
{
"start_time": "2024-08-23T09:27:27.939Z",
"method": "GET",
"x-envoy-origin-path": "/",
"protocol": "HTTP/2",
"response_code": "503",
"response_flags": "UF",
"response_code_details": "upstream_reset_before_response_started{connection_timeout}",
"connection_termination_details": "-",
"upstream_transport_failure_reason": "-",
"bytes_received": "0",
"bytes_sent": "91",
"duration": "9998",
"x-envoy-upstream-service-time": "-",
"x-forwarded-for": "111.222.333.444",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"x-request-id": "ed5937ea-6a79-4639-98c7-a84f4471b94c",
":authority": "webapp.mydomain.mytld",
"upstream_host": "1.2.3.4:80",
"upstream_cluster": "httproute/dev-stage/webapp/rule/0",
"upstream_local_address": "-",
"downstream_local_address": "11.22.33.44:10443",
"downstream_remote_address": "111.222.333.444:44107",
"requested_server_name": "webapp.mydomain.mytld",
"route_name": "httproute/dev-stage/webapp/rule/0/match/0/webapp_mydomain_mytld"
}
Webapp Manifests
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
namespace: dev-stage
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 2 # the old replica must be kept running until new replica is fully operational
maxSurge: 1 # 1 old and 1 new replica can be active at the same time during deployments
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
affinity:
podAffinity:
# prefer to schedule related pods on same host
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["api", "webapp", "swagger-ui"]
topologyKey: kubernetes.io/hostname
podAntiAffinity:
# require to *not* schedule pods on the same *host* where we are already running again
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["webapp"]
topologyKey: kubernetes.io/hostname
terminationGracePeriodSeconds: 10
containers:
- image: <myregistry>/<myimage>:<tag>
name: webapp
ports:
- containerPort: 80
name: webapp
---
kind: Service
apiVersion: v1
metadata:
name: webapp
namespace: dev-stage
spec:
selector:
app: webapp
ports:
# public
- name: webapp
protocol: TCP
port: 80
targetPort: 80
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: webapp
namespace: dev-stage
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: envoy-gw
namespace: gwapi-system
hostnames:
- "webapp.mydomain.mytld"
rules:
- backendRefs:
- name: webapp
kind: Service
namespace: dev-stage
port: 80
weight: 1
matches:
- path:
type: PathPrefix
value: /
Gateway Manifests
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: GatewayClass
metadata:
name: envoy-gc
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
parametersRef:
group: gateway.envoyproxy.io
kind: EnvoyProxy
name: custom-proxy-config
namespace: gwapi-system
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: custom-proxy-config
namespace: gwapi-system
spec:
provider:
type: Kubernetes
kubernetes:
envoyDeployment:
replicas: 3
envoyService:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
name: envoy-gw
namespace: gwapi-system
spec:
gatewayClassName: envoy-gc
listeners:
- allowedRoutes:
namespaces:
from: All
hostname: '*.mydomain.mytld'
name: http
port: 80
protocol: HTTP
- allowedRoutes:
namespaces:
from: All
hostname: '*.mydomain.mytld'
name: https
port: 443
protocol: HTTPS
tls:
certificateRefs:
- group: ""
kind: Secret
name: envoy-gw-tls-cert
mode: Terminate
@amalic
- is the error consistently seen or only sometimes after a duration ?
- can you try
v1.1.0instead, does anything change with that helm chart ? - the access logs shows
"protocol": "HTTP/2", but you are not usingGRPCRoutenor are you setting anyapplicationProtocolfield on the Service, so its weird why envoy is trying to connect to the upstream over http2
is the error consistently seen or only sometimes after a duration ?
yes
can you try
v1.1.0instead, does anything change with that helm chart ?
not at the moment
the access logs shows
"protocol": "HTTP/2", but you are not usingGRPCRoutenor are you setting anyapplicationProtocolfield on the Service, so its weird why envoy is trying to connect to the upstream over http2
It's very strange.
The docker File is based on a nginx:alpine image. I even tried to increase timeouts and setting http1 protocol through a ClientTrafficPolicy and 5 retries on any 5xx error through a BackendTrafficPolicy. Still the same result. And like I already said, when I port-forward the service or pod I get the expected response.
nginx default.config
server {
listen 80;
server_name _;
#...
}
@amalic the issue is that
kind: HTTPRoute
metadata:
name: webapp
is in the default ns and your backend is in dev-stage and there isnt any ReferenceGrant to allow linking route and backend, can you either add a ReferenceGrant or move the route into the backend ns ?
the status field on the resource should be surfacing this
@arkodg Thanks for pointing that out. I actually copied the manifest from the yaml file which is applied using the specific namespace with kubectl. I double checked if it's in the correct namespace on the cluster, and fixed it in the samples I provided.
@arkodg Thanks to your HTTP2 comment I expanded my research, I came across this on the ISTIO Traffic Management Problems site
Envoy requires HTTP/1.1 or HTTP/2 traffic for upstream services. For example, when using NGINX for serving traffic behind Envoy, you will need to set the proxy_http_version directive in your NGINX configuration to be “1.1”, since the NGINX default is 1.0.
https://istio.io/latest/docs/ops/common-problems/network-issues/#envoy-wont-connect-to-my-http10-service
What do you think?
@arkodg When I run nginx -T in a shell within the container I get followint output. Means the server is responding via HTTP 1.1. I can confirm this when doing a curl against the port-forwarded service and pods. I wll try to update to latest Envoy version to see if it will fix the error.
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
# configuration file /etc/nginx/nginx.conf:
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log notice;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
keepalive_timeout 65;
#gzip on;
include /etc/nginx/conf.d/*.conf;
}
# configuration file /etc/nginx/mime.types:
types {
text/html html htm shtml;
text/css css;
text/xml xml;
image/gif gif;
image/jpeg jpeg jpg;
application/javascript js;
application/atom+xml atom;
application/rss+xml rss;
text/mathml mml;
text/plain txt;
text/vnd.sun.j2me.app-descriptor jad;
text/vnd.wap.wml wml;
text/x-component htc;
image/avif avif;
image/png png;
image/svg+xml svg svgz;
image/tiff tif tiff;
image/vnd.wap.wbmp wbmp;
image/webp webp;
image/x-icon ico;
image/x-jng jng;
image/x-ms-bmp bmp;
font/woff woff;
font/woff2 woff2;
application/java-archive jar war ear;
application/json json;
application/mac-binhex40 hqx;
application/msword doc;
application/pdf pdf;
application/postscript ps eps ai;
application/rtf rtf;
application/vnd.apple.mpegurl m3u8;
application/vnd.google-earth.kml+xml kml;
application/vnd.google-earth.kmz kmz;
application/vnd.ms-excel xls;
application/vnd.ms-fontobject eot;
application/vnd.ms-powerpoint ppt;
application/vnd.oasis.opendocument.graphics odg;
application/vnd.oasis.opendocument.presentation odp;
application/vnd.oasis.opendocument.spreadsheet ods;
application/vnd.oasis.opendocument.text odt;
application/vnd.openxmlformats-officedocument.presentationml.presentation
pptx;
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
xlsx;
application/vnd.openxmlformats-officedocument.wordprocessingml.document
docx;
application/vnd.wap.wmlc wmlc;
application/wasm wasm;
application/x-7z-compressed 7z;
application/x-cocoa cco;
application/x-java-archive-diff jardiff;
application/x-java-jnlp-file jnlp;
application/x-makeself run;
application/x-perl pl pm;
application/x-pilot prc pdb;
application/x-rar-compressed rar;
application/x-redhat-package-manager rpm;
application/x-sea sea;
application/x-shockwave-flash swf;
application/x-stuffit sit;
application/x-tcl tcl tk;
application/x-x509-ca-cert der pem crt;
application/x-xpinstall xpi;
application/xhtml+xml xhtml;
application/xspf+xml xspf;
application/zip zip;
application/octet-stream bin exe dll;
application/octet-stream deb;
application/octet-stream dmg;
application/octet-stream iso img;
application/octet-stream msi msp msm;
audio/midi mid midi kar;
audio/mpeg mp3;
audio/ogg ogg;
audio/x-m4a m4a;
audio/x-realaudio ra;
video/3gpp 3gpp 3gp;
video/mp2t ts;
video/mp4 mp4;
video/mpeg mpeg mpg;
video/quicktime mov;
video/webm webm;
video/x-flv flv;
video/x-m4v m4v;
video/x-mng mng;
video/x-ms-asf asx asf;
video/x-ms-wmv wmv;
video/x-msvideo avi;
}
# configuration file /etc/nginx/conf.d/default.conf:
server {
listen 80;
server_name _;
location / {
port_in_redirect off;
alias /etc/nginx/html/;
proxy_http_version 1.1;
try_files $uri $uri/ //index.html;
# don't cache anything by default
add_header Cache-Control "no-store, no-cache, must-revalidate";
}
location //static {
port_in_redirect off;
alias /etc/nginx/html/static;
proxy_http_version 1.1;
expires 1y;
# cache create react app generated files because they all have a hash in the name and are therefore automatically invalidated after a change
add_header Cache-Control "public";
}
}
Strangest thing. I did another nginx test deployment, and I accidentally got a response when trying another reload. I found out that reloading multiple times eventually leads to a successful response. Thanks to the nginxdemos/hello image I could see that the sucessfull response was always coming from the same container. After trying to scale the deployment up and down I found out that the container delivering a successful response was always running on the same node.
After adding NodeAffinity to the , I was able to deployment template spec I was able to get a response from all replicas.
Update: The nginx container is not available any more. When I deploy it on all nodes it now sometimes works on some other random node.
Here's the dployment I used:
---
apiVersion: v1
kind: Namespace
metadata:
name: nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
namespace: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: mylabel
operator: In
values:
- myvalue
containers:
- name: nginx
image: nginxdemos/hello:latest
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx-service
namespace: nginx
spec:
selector:
app: nginx
ports:
- name: http
port: 80
targetPort: 80
type: ClusterIP
---
apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: nginx-test
namespace: nginx
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: envoy-gw
namespace: gwapi-system
hostnames:
- "ngx-test.mydomain.mytld"
rules:
- backendRefs:
- name: nginx-service
kind: Service
namespace: nginx
port: 80
weight: 1
timeouts:
backendRequest: 0s
request: 0s
matches:
- path:
type: PathPrefix
value: /
closing this one since it looks like it was related to the backend and was resolved
Update: My previous solution was not correct and did not fix the problem.
Turns out since I am running Karpenter Autoscaler, I had to make sure that Envoy Proxy pods are running on Karpenter nodes by adding a node affinity to the pod spec of the custom-proxy-conf
This is what ended up working for me.
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: custom-proxy-config
namespace: gwapi-system
spec:
logging:
level:
default: warn
provider:
kubernetes:
envoyDeployment:
pod:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: autoscaler
operator: In
values:
- karpenter
replicas: 3
envoyService:
annotations:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-type: external
externalTrafficPolicy: Cluster
type: LoadBalancer
type: Kubernetes
I think this is a workaround for my issue. Once I find the root cause, I will update this issue.
@amalic I am seeing similar issues when exposing ArgoCD UI using a HTTPRoute. Were you able to find the root cause of this?
If I recall correctly, the error message was misleading. I was actually running out out of assignable IP addresses on EKS nodes. AWS by default limits the maximum number of IP addresses per node, depending on the instance type.
I had to enable IP prefix mode on the cluster.
If I recall correctly, the error message was misleading. I was actually running out out of assignable IP addresses on EKS nodes. AWS by default limits the maximum number of IP addresses per node, depending on the instance type.
I had to enable IP prefix mode on the cluster.
Thank you. After I restarted the argocd-server pods, this seems to be resolved. I am not running into any IP limit issues but yes the error was definitely misleading.