contour icon indicating copy to clipboard operation
contour copied to clipboard

Websocket Timeout

Open ccravens opened this issue 3 years ago • 10 comments

Websocket connections seem to be timing-out every 16s:

image

Expected behavior is that the websocket connection maintains Pending status indefinitely and does not timeout:

image

It seems this issue occurs after upgrading to Contour 1.15.0

  • Contour version: 1.15.0
  • Kubernetes version: (use kubectl version): 1.20.4
  • Kubernetes installer & version: kubeadm
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): CentOS 8

ccravens avatar May 19 '21 03:05 ccravens

May be related to this: https://github.com/envoyproxy/envoy/issues/16129

ccravens avatar May 19 '21 03:05 ccravens

Hmm, interestingly, I can't reproduce when using the following HTTPProxy:

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: wstest
  namespace: wstest
  annotations:
    kubernetes.io/ingress.class: contour
spec:
  virtualhost:
    fqdn: wstest.youngnick.dev
    tls:
      secretName: wstestcert
  routes:
    - conditions:
      - prefix: /
      enableWebsockets: true
      services:
        - name: wstest
          port: 8010

This was with Contour 1.15.0, and Envoy 1.18.3, as per our example YAML.

If this is related to envoyproxy/envoy#16129, this would make sense, as it needs to be on a sub-path. I'll try that next.

youngnick avatar May 19 '21 04:05 youngnick

Okay, I changed the HTTPProxy as follows:

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: wstest
  namespace: wstest
  annotations:
    kubernetes.io/ingress.class: contour
spec:
  virtualhost:
    fqdn: wstest.youngnick.dev
    tls:
      secretName: wstestcert
  routes:
    - conditions:
      - prefix: /
      enableWebsockets: true
      services:
        - name: echoserver
          port: 80
    - conditions:
      - prefix: /websocket
      enableWebsockets: true
      services:
        - name: wstest
          port: 8010

And I still can't get this to reproduce. I also tried with the TLS config removed, same deal.

@ccravens, could you try your client against this setup? I'll leave wss://wstest.youngnick.dev/websocket running overnight my time so you should have a chance to have a go. I think it's very unlikely, but then that rules out any issues between your client and your server. Then we can try and figure out what's different between your HTTPProxy and mine. Any chance you could post it here (or message it to me on Slack or something?)

youngnick avatar May 19 '21 04:05 youngnick

Thanks @youngnick! What I'm seeing on my end is there are essentially 2 services using 2 different HTTPProxy on 2 different pods in 2 different namespaces that are experiencing this exact behavior described in the issue. Service names are 1) app and 2) IDE. Please find below the specifics of the configurations for each.

Screenshot of the 2 services both experiencing a 16s reconnect (ws is app, and services is IDE): image

Service 1 - App

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: app
  namespace: app
spec:
  routes:
  - conditions:
    - prefix: /
    services:
    - name: app
      port: 80
  - conditions:
    - prefix: /api
    services:
    - name: app
      port: 3000
  - conditions:
    - prefix: /ws
    enableWebsockets: true
    services:
    - name: app
      port: 3000
  virtualhost:
    fqdn: app.example.com
    tls:
      secretName: app-certificate
apiVersion: v1
kind: Service
metadata:
  name: app
  namespace: app
spec:
  ports:
  - name: frontend
    port: 80
    protocol: TCP
    targetPort: 8080
  - name: api
    port: 3000
    protocol: TCP
    targetPort: 3000
  selector:
    app: app
  type: ClusterIP

Service 2 - IDE

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: ide
  namespace: ide
spec:
  routes:
  - conditions:
    - prefix: /
    enableWebsockets: true
    services:
    - name: ide
      port: 8080
  virtualhost:
    fqdn: ide.example.com
    tls:
      secretName: ide-wildcard-certificate
apiVersion: v1
kind: Service
metadata:
  name: ide
  namespace: ide
spec:
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: ide
  sessionAffinity: None
  type: ClusterIP

ccravens avatar May 19 '21 15:05 ccravens

in office hours we were able to find this Envoy commit: https://github.com/envoyproxy/envoy/pull/15585

if there is no max_stream_duration set, the route response timeout from the upstream is used, which defaults to 15s

we had @ccravens deploy Envoy 1.17 and saw the reconnects go away, with 1.18 we were able to change the https://projectcontour.io/docs/v1.15.1/config/api/#projectcontour.io/v1.TimeoutPolicy response timeout field to something larger than 15s and see it reflected in the reconnects (setting to infinity of course made the reconnects go away entirely)

so action items and things to discuss:

  • max_stream_duration is an HTTPConnectionManager.CommonHttpProtocolOptions field (so it is not unique to a route), do we allow this to be configured?
  • Do we instead set a default request timeout when we know we have a websocket route?
  • Do just we make users configure their request timeout explicitly when using websockets (the case after Envoy 1.18, because of the above change)?

sunjayBhatia avatar May 20 '21 18:05 sunjayBhatia

we still are a little confused why @youngnick was not able to repro the issue, we may need to do some checks on differences between environments potentially

sunjayBhatia avatar May 20 '21 18:05 sunjayBhatia

Just came back to this one:

  • I think we can allow configuration of max_stream_duration, and we should especially note its usefulness for websockets in both its docstring and the docstring field.
  • Setting a default request timeout sounds reasonable when we have a websocket route, but I am not sure what it should be. @ccravens, do you have any thoughts here?
  • This is where we are at right now, I hope we can do better somehow. I think that at least having some note in the websocket field docs that it's important to tune some timeouts would be helpful.

youngnick avatar Nov 25 '21 23:11 youngnick

Just came back to this one:

  • I think we can allow configuration of max_stream_duration, and we should especially note its usefulness for websockets in both its docstring and the docstring field.

Looks like max_stream_duration is also useful not only for Websockets, but for gRPC streams too. I'm trying to setup HTTPProxy for gRPC streaming service, and notice, that my service receives stream reset approximately after 15secs. Is there a way right now to tweak max_stream_duration for particular service/route in HTTPProxy

echupriyanov avatar Dec 27 '21 14:12 echupriyanov

@echupriyanov I believe that per https://github.com/projectcontour/contour/issues/3692#issuecomment-845351552, as a workaround right now if you set the route's response timeout to something larger than 15s, that will effectively be the stream timeout value for that particular route. So something like:

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: foo
  namespace: default
spec:
  routes:
  - conditions:
    - prefix: /
    services:
    - name: foo
      port: 8080
    timeoutPolicy:
      response: 1h
  ...

Note that using response: infinity would completely disable the timeout, though you may want something large but non-infinite in practice.

Let us know if that works for you.

skriss avatar Jan 03 '22 16:01 skriss

@skriss Great thanks! Yes, setting response timeout to large values does the trick.

echupriyanov avatar Jan 05 '22 12:01 echupriyanov

I don't think this is being investigated any longer, removing the label and closing since there hasn't been recent activity

sunjayBhatia avatar Nov 03 '23 15:11 sunjayBhatia