gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Envoy will send requests to a closed connection

Open kevin-shelaga opened this issue 3 years ago • 1 comments

Gloo Edge Version

1.11.x (latest stable)

Kubernetes Version

1.22.x

Describe the bug

With an http1 upstream envoy will ACK a FIN, but send a request anyways shortly after.

image

Possible regression of: https://github.com/envoyproxy/envoy/issues/6190

Customer Details:

https://solo-io.slack.com/archives/G01MKQQ0Q3W/p1655824554856459 https://solo-io.slack.com/archives/G01MKQQ0Q3W/p1655824122208049

Steps to reproduce the bug

  1. Install Gloo Edge

  2. deploy test app

apiVersion: apps/v1
kind: Deployment
metadata:
  name: expressapp
  namespace: gloo-system
  labels:
    app: expressapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: expressapp
  template:
    metadata:
      labels:
        app: expressapp
    spec:
      containers:
      - name: expressapp
        image: kevinshelaga/expressapp:latest
        ports:
        - containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
  name: expressapp-service
  namespace: gloo-system
spec:
  selector:
    app: expressapp
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000
  type: ClusterIP
  1. create VS
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: default-http
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - "*"  
    routes:
      - matchers:
          - prefix: /
        routeAction:
          single:
            upstream:
              name: "gloo-system-expressapp-service-3000"
              namespace: gloo-system
  1. run k6 test
import http from "k6/http";
import { sleep } from "k6";
import { check } from "k6";

export const options = {
  vus: 100,
  duration: "60s",
};

export default function () {
  const res = http.get(
    "http://YOUR-GWP-IP:80"
  );
  sleep(5);
  check(res, {
    "is status 503": (r) => r.status === 503,
  });

  if (res.status == 503) {
    console.error(res.body);
  }
}

Results: (going direct to service this does not happen)

ERRO[0010] upstream connect error or disconnect/reset before headers. reset reason: connection termination  source=console
ERRO[0030] upstream connect error or disconnect/reset before headers. reset reason: connection termination  source=console
ERRO[0055] upstream connect error or disconnect/reset before headers. reset reason: connection termination  source=console
ERRO[0055] upstream connect error or disconnect/reset before headers. reset reason: connection termination  source=console

running (1m00.7s), 000/100 VUs, 1200 complete and 0 interrupted iterations
default ✓ [======================================] 100 VUs  1m0s

     ✗ is status 503
      ↳  0% — ✓ 4 / ✗ 1196
{"responseFlags":"UC","remoteAddress":"192.168.59.115:9988","modifiedPath":"/","xForwardedFor":null,"upstreamName":"gloo-system-expressapp-service-3000_gloo-system","systemTime":"2022-06-27T14:20:13.903Z","site":null,"httpMethod":"GET","apiKey":null,"userName":null,"path":"/","clientDuration":1,"filterState":null,"responseCode":503,"targetDuration":null,"protocol":"HTTP/1.1","requestId":"98efe19f-baba-450a-9080-87c739736474"}

Expected Behavior

Envoy should open a new connection before sending a request

Additional Context

Customer issue with express apps

kevin-shelaga avatar Jun 27 '22 17:06 kevin-shelaga

Zendesk ticket #788 has been linked to this issue.

soloio-bot avatar Jul 12 '22 14:07 soloio-bot

It appears this is an upstream bug that has existed for quite some time. All versions I tested from 1.24 back to 1.17 exhibited the errant behavior. The following is the reproduction I performed based on Kevin's steps above.

I have used the following yaml to create a testing environment which has demonstrated that this issue occurs under the following conditions:

  1. when using Edge and vanilla Envoy, but not nginx or direct requests
  2. when using an ExpressJS upstream, but not a Go upstream (see kevinshelaga/expressapp or bohanon/echo-server-express (source) for the Express applications used and bohanon/echo-server (source) for the Go application used)
  3. when using the testing tool k6 (but not hey)

Point 3 is probably irrelevant and possibly point 2 as well, but I found it interesting that it only was occurring with Express apps.

The testing yaml below can be applied to a fresh cluster with Gloo Edge installed to set up the test upstreams, a virtual service for Edge to route to them, and basic nginx/envoy proxies to route to them as well. The k6 tool can be installed via brew install k6 or see here for alternate methods.

6647-proxies.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-server-express
  namespace: gloo-system
  labels:
    app: echo-server-express
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo-server-express
  template:
    metadata:
      labels:
        app: echo-server-express
    spec:
      containers:
      - name: echo-server-express
        image: bohanon/echo-server-express:latest
        ports:
        - containerPort: 3000
---
apiVersion: v1
kind: Service
metadata:
  name: echo-server-express-service
  namespace: gloo-system
spec:
  selector:
    app: echo-server-express
  ports:
    - protocol: TCP
      port: 3000
      targetPort: 3000
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-server
  namespace: gloo-system
  labels:
    app: echo-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo-server
  template:
    metadata:
      labels:
        app: echo-server
    spec:
      containers:
      - name: echo-server
        image: bohanon/echo-server:latest
        ports:
        - containerPort: 9999
---
apiVersion: v1
kind: Service
metadata:
  name: echo-server-service
  namespace: gloo-system
spec:
  selector:
    app: echo-server
  ports:
    - protocol: TCP
      port: 9999
      targetPort: 9999
  type: ClusterIP
---
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: default-http
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - "*"
    routes:
      - matchers:
          - prefix: /express
        routeAction:
          single:
            upstream:
              name: "gloo-system-echo-server-express-service-3000"
              namespace: gloo-system
        options:
          prefixRewrite: '/'
      - matchers:
          - prefix: /go
        routeAction:
          single:
            upstream:
              name: "gloo-system-echo-server-service-9999"
              namespace: gloo-system
        options:
          prefixRewrite: '/'
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-conf
  namespace: gloo-system
data:
  nginx.conf: |
    user nginx;
    worker_processes  3;
    error_log  /var/log/nginx/error.log;
    events {
      worker_connections  10240;
    }
    http {
      log_format  main
              'remote_addr:$remote_addr\t'
              'time_local:$time_local\t'
              'method:$request_method\t'
              'uri:$request_uri\t'
              'host:$host\t'
              'status:$status\t'
              'bytes_sent:$body_bytes_sent\t'
              'referer:$http_referer\t'
              'useragent:$http_user_agent\t'
              'forwardedfor:$http_x_forwarded_for\t'
              'request_time:$request_time';
      access_log	/var/log/nginx/access.log main;
      include /etc/nginx/virtualhost/virtualhost.conf;
    }
  virtualhost.conf: |
    server {
      resolver 10.96.0.10 valid=10s;
      listen 8081 default_server;
      root /usr/local/app;
      access_log /var/log/nginx/app.access_log main;
      error_log /var/log/nginx/app.error_log;
      location /go {
        set $gosvc "echo-server-service.gloo-system.svc.cluster.local:9999";
        rewrite /go / break;
        proxy_pass http://$gosvc;
        proxy_http_version 1.1;
      }
      location /express {
        set $expresssvc "echo-server-express-service.gloo-system.svc.cluster.local:3000";
        rewrite /express / break;
        proxy_pass http://$expresssvc;
        proxy_http_version 1.1;
      }
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: gloo-system
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 8081
        volumeMounts:
        - mountPath: /etc/nginx
          readOnly: true
          name: nginx-conf
      volumes:
      - name: nginx-conf
        configMap:
          name: nginx-conf
          items:
          - key: nginx.conf
            path: nginx.conf
          - key: virtualhost.conf
            path: virtualhost/virtualhost.conf
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-service
  namespace: gloo-system
spec:
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: 8081
      targetPort: 8081
  type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: raw-envoy-config
  namespace: gloo-system
data:
  envoy.yaml: |
    static_resources:
      listeners:
        - address:
            socket_address:
              address: 0.0.0.0
              port_value: 19000
          filter_chains:
            - filters:
                - name: envoy.filters.network.http_connection_manager
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                    codec_type: auto
                    stat_prefix: ingress_http
                    route_config:
                      name: local_route
                      virtual_hosts:
                        - name: backend
                          domains:
                            - "*"
                          routes:
                            - match:
                                prefix: "/"
                              route:
                                cluster: service
                    http_filters:
                      - name: envoy.filters.http.router
                        typed_config:
                          "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
        - address:
            socket_address:
              address: 0.0.0.0
              port_value: 8082
          filter_chains:
            - filters:
                - name: envoy.filters.network.http_connection_manager
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                    codec_type: auto
                    stat_prefix: ingress_http
                    route_config:
                      name: local_route
                      virtual_hosts:
                        - name: backend
                          domains:
                            - "*"
                          routes:
                            - match:
                                prefix: "/go"
                              route:
                                cluster: echo-server
                                prefix_rewrite: "/"
                            - match:
                                prefix: "/express"
                              route:
                                cluster: echo-server-express
                                prefix_rewrite: "/"
                    http_filters:
                      - name: envoy.filters.http.router
                        typed_config:
                          "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
      clusters:
        - name: service
          connect_timeout: 0.25s
          type: strict_dns
          lb_policy: round_robin
          load_assignment:
            cluster_name: service
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: 127.0.0.1
                          port_value: 19001
        - name: echo-server-express
          connect_timeout: 0.25s
          type: strict_dns
          lb_policy: round_robin
          load_assignment:
            cluster_name: echo-server-express
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: echo-server-express-service.gloo-system.svc.cluster.local
                          port_value: 3000
        - name: echo-server
          connect_timeout: 0.25s
          type: strict_dns
          lb_policy: round_robin
          load_assignment:
            cluster_name: echo-server
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: echo-server-service.gloo-system.svc.cluster.local
                          port_value: 9999
    admin:
      access_log_path: "/dev/null"
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 19001
---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: vanilla-envoy
  namespace: gloo-system
  labels:
    app: vanilla-envoy
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vanilla-envoy
  template:
    metadata:
      labels:
        app: vanilla-envoy
    spec:
      containers:
        - name: envoy
          image: envoyproxy/envoy:v1.23.2
          ports:
            - containerPort: 8082
          securityContext:
            runAsUser: 1111
          volumeMounts:
          - readOnly: true
            mountPath: /config
            name: proxy-config
          args:
          - "envoy"
          - "--config-path"
          - "/config/envoy.yaml"
          - "--log-level"
          - "debug"
      volumes:
        - name: proxy-config
          configMap:
            name: raw-envoy-config
---
apiVersion: v1
kind: Service
metadata:
  name: vanilla-envoy
  namespace: gloo-system
spec:
  selector:
    app: vanilla-envoy
  ports:
    - protocol: TCP
      port: 8082
      targetPort: 8082
  type: ClusterIP

Ports need to be forwarded with (in separate shells)

kubectl port-forward -n gloo-system deploy/vanilla-envoy 8082:8082
kubectl port-forward -n gloo-system deploy/nginx 8081:8081
kubectl port-forward -n gloo-system deploy/gateway-proxy 8080:8080
kubectl port-forward -n gloo-system deploy/nginx 8081:8081
kubectl port-forward -n gloo-system deploy/echo-server-express 3000:3000
kubectl port-forward -n gloo-system deploy/echo-server 9999:9999

The following are k6 scripts that can be run to count how many times a 503 is received over 1200 requests (100 virtual users each making a request every 5 seconds for 60 seconds).

k6 run path/to/test/script.js
k6 scripts
6647-k6-express-direct.js
import http from "k6/http";
import { sleep } from "k6";
import { check } from "k6";

export const options = {
  vus: 100,
  duration: "60s",
};

export default function () {
  const res = http.get(
    "http://localhost:3000"
  );
  sleep(5);
  check(res, {
    "is status 503": (r) => r.status === 503,
  });

  if (res.status == 503) {
    console.error(res.body);
  }
}
6647-k6-express-nginx.js
import http from "k6/http";
import { sleep } from "k6";
import { check } from "k6";

export const options = {
  vus: 100,
  duration: "60s",
};

export default function () {
  const res = http.get(
    "http://localhost:8081/express"
  );
  sleep(5);
  check(res, {
    "is status 503": (r) => r.status === 503,
  });

  if (res.status == 503) {
    console.error(res.body);
  }
}
6647-k6-express-vanilla.js
import http from "k6/http";
import { sleep } from "k6";
import { check } from "k6";

export const options = {
  vus: 100,
  duration: "60s",
};

export default function () {
  const res = http.get(
    "http://localhost:8082/express"
  );
  sleep(5);
  check(res, {
    "is status 503": (r) => r.status === 503,
  });

  if (res.status == 503) {
    console.error(res.body);
  }
}
6647-k6-express-edge.js
import http from "k6/http";
import { sleep } from "k6";
import { check } from "k6";

export const options = {
  vus: 100,
  duration: "60s",
};

export default function () {
  const res = http.get(
    "http://localhost:8080/express"
  );
  sleep(5);
  check(res, {
    "is status 503": (r) => r.status === 503,
  });

  if (res.status == 503) {
    console.error(res.body);
  }
}
6647-k6-go-direct.js
import http from "k6/http";
import { sleep } from "k6";
import { check } from "k6";

export const options = {
  vus: 100,
  duration: "60s",
};

export default function () {
  const res = http.get(
    "http://localhost:9999"
  );
  sleep(5);
  check(res, {
    "is status 503": (r) => r.status === 503,
  });

  if (res.status === 503) {
    console.error(res.body);
  }
}
6647-k6-go-edge.js
import http from "k6/http";
import { sleep } from "k6";
import { check } from "k6";

export const options = {
  vus: 100,
  duration: "60s",
};

export default function () {
  const res = http.get(
    "http://localhost:8080/go"
  );
  sleep(5);
  check(res, {
    "is status 503": (r) => r.status === 503,
  });

  if (res.status === 503) {
    console.error(res.body);
  }
}

jbohanon avatar Oct 19 '22 16:10 jbohanon

@yuval-k Is there anything in here that rings any bells? @nfuden mentioned you had experience with similar bugs in upstream.

jbohanon avatar Oct 19 '22 17:10 jbohanon

@kevin-shelaga I have created upstream bug-report issue envoyproxy/envoy#23625. Please let us know how to prioritize the further investigation into the upstream bug.

jbohanon avatar Oct 21 '22 16:10 jbohanon

In October, we presented a resolution to this issue whereby we injected a Connection: Close header to requests to the upstream in hopes that this would help resolve the intermittent 503s. The customer is concerned about the performance implications of never re-using connections. As we have learned from investigation, this is not a bug with Envoy, rather this is caused by improperly behaving upstream servers which fail to send a Connection: Close header in their response to Envoy even when they terminate the connection immediately after. Because the server failed to inform Envoy it would be closing the connection, Envoy allows that connection to be reassigned. The request to which the connection was reassigned then sees the connection be reset when the server finally sends the FIN.

In short, the workaround that was proposed in October and the one I propose below are not workarounds for a bug in Envoy, rather these are workarounds for a poorly behaving upstream server. Envoy will not reassign the connection if the server responds with the Connection: Close header.

It is possible to attempt to mitigate this by configuring retries in Gloo Edge using a retryOn value of reset.

Here is a link to a comment about the race condition when reusing connections with a poorly behaving upstream

jbohanon avatar May 15 '23 17:05 jbohanon

I have communicated that this is an issue with the underlying frameworks to the customers and unfortunately there is nothing envoy can do to fix the issue. The prescribed workaround to apply the connection close header is the best approach to resolve any 503s.

kevin-shelaga avatar May 22 '23 13:05 kevin-shelaga