caddy icon indicating copy to clipboard operation
caddy copied to clipboard

Don't close active websocket connections on config reload

Open kevinlang opened this issue 1 year ago • 7 comments

Split off from #3143

Currently Caddy closes all active websocket connections on config reload. This impacts the ability to do graceful blue-green deployments where all new traffic is directed to the new version of the application while the existing application's websocket connections are slowly drained at the application level.

Ideally, Caddy would not close active websocket connections associated with old configs/upstreams, instead keeping them around. Optionally, there can be a new config value that is an upper-bound on how long they should be kept around before being forced-close to address any sort of memory leak concerns people may have. (see https://github.com/caddyserver/caddy/issues/3143#issuecomment-1478298087)

This behavior of keeping active websockets around for old/updated upstream's matches NGINX's behavior for config reloads.

kevinlang avatar Mar 30 '23 14:03 kevinlang

New here. I understand this is not a good-first-issue. Im interested in learning this codebase. Can I pick this?

v4rn avatar Apr 11 '23 20:04 v4rn

Sure, you can give it a shot if you'd like. I started work on this a while back but didn't finish it. https://github.com/caddyserver/caddy/compare/ws-close-timeouts The config layer is done, but the actual usage of the timeouts is incomplete/wrong.

francislavoie avatar Apr 11 '23 21:04 francislavoie

Hello. I was also interested in having this done so I decided to give it a try and I think I have figured the closing logic out. For the moment the result is at https://github.com/mmm444/caddy/tree/ws-close-timeouts .

Shall I file a PR? And if yes, against what branch? master or ws-close-timeouts?

mmm444 avatar Jun 08 '23 10:06 mmm444

Interesting! Yeah I never opened a PR (because I didn't think the code quality was quite there yet) so feel free. Against master 👍

What kind of tests did you perform to validate it?

francislavoie avatar Jun 08 '23 15:06 francislavoie

That's cool, we'd be happy to review it -- feel free to open a PR. Thanks for working on this!

mholt avatar Jun 08 '23 16:06 mholt

Thanks. Will submit PR in a short time.

What kind of tests did you perform to validate it?

Basically manual tests only. I had a caddy with reverse proxy (obviously) and one upstream server that accepted websockets. I used a browser to initiate the websocket connection. And this is what I did:

  • set only StreamTimeout and verified that the connection got closed after the given timeout
  • set only StreamTimeout, closed the WS from the browser and verified that the timeout timer got canceled
  • set only StreamCloseDelay, triggered config reload and verified that the stream got closed after the delay
  • set only StreamCloseDelay, triggered config reload, manually closed the WS in browser and verified that the delay timer got destroyed just after the websocket closed
  • set both StreamTimeout and StreamCloseDelay, triggered config reload in the time when StreamTimeout happened before StreamCloseDelay and verified that timer associated to StreamCloseDelay got destroyed and that the connection closed
  • set both StreamTimeout and StreamCloseDelay, triggered config reload in the time when StreamTimeout happened after StreamCloseDelay and verified that timer associated to StreamTimeout got destroyed and that the connection closed

mmm444 avatar Jun 08 '23 19:06 mmm444

@mmm444 Awesome, thanks for the PR! I'm hoping to circle back to this very soon. :) Would be great to get it in the beta for 2.7.

mholt avatar Jun 13 '23 22:06 mholt

I think this is taken care of now, in 2.7.

mholt avatar May 10 '24 19:05 mholt

@mholt Hi there, we're on version v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A= and we're actually still seeing our websockets die on caddy reload 😅

Not sure if there's any known edge case scenarios where that would occur, but it's pretty overt in our frontend Client (we'll perform some generation and it will abruptly die the moment I execute the reload).

chongzluong avatar May 28 '24 22:05 chongzluong

@chongzluong You need to specify stream_timeout and/or stream_delay: https://caddyserver.com/docs/modules/http.handlers.reverse_proxy#stream_timeout

See also https://github.com/caddyserver/caddy/pull/5567

If you have further questions please feel free to ask on our forums!

mholt avatar May 29 '24 17:05 mholt

@mholt Hi, we updated caddy to 2.8 and can't manage to make this new feature work. We added the stream_close_delay flag in our config. But once we add a new handle using the API, we notice that all of websocket clients get disconnected for a short time. This is systematic : we made a script that triggers the API call and we clearly notice the systematic disconnect on all websocket client's logs. Thanks for your help

{
  "srv0": {
    "automatic_https": {
      "disable": true
    },
    "errors": {
      "@id": "srv0-error",
      "routes": [
        {
          "handle": [
            {
              "handler": "rewrite",
              "uri": "{http.matchers.file.relative}"
            },
            {
              "handler": "file_server",
              "root": "/var/www/html"
            }
          ],
          "match": [
            {
              "file": {
                "root": "/var/www/html",
                "try_files": [
                  "{http.request.uri.path}",
                  "index.html",
                  "=404"
                ]
              }
            }
          ]
        }
      ]
    },
    "listen": [
      ":443"
    ],
    "routes": [
      {
        "handle": [
          {
            "handler": "subroute",
            "routes": [
              {
                "handle": [
                  {
                    "body": "port 443",
                    "handler": "static_response"
                  }
                ]
              }
            ]
          }
        ],
        "match": [
          {
            "host": [
              "our-domain.com"
            ]
          }
        ],
        "terminal": true
      },
      {
        "@id": "srv0-u982998-p443",
        "handle": [
          {
            "handler": "reverse_proxy",
            "stream_close_delay": "60s",
            "transport": {
              "protocol": "http",
              "tls": {
                "server_name": "u982998.our-domain.com"
              }
            },
            "upstreams": [
              {
                "dial": "our-dial.com:30190"
              }
            ]
          },
          {
            "handler": "reverse_proxy",
            "stream_close_delay": "60s",
            "transport": {
              "protocol": "http",
              "tls": {
                "server_name": "*.u982998.our-domain.com"
              }
            },
            "upstreams": [
              {
                "dial": "our-dial.com:30190"
              }
            ]
          }
        ],
# etc ...

daweedm avatar Jun 17 '24 13:06 daweedm

@daweedm Can you open a new issue so we can help troubleshoot? Please reduce it to the minimum reproducer so we can easily verify and fix.

Here's a template to help out when you post a new issue:


This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```


### 1b. Caddy version (run `caddy version` or paste commit SHA)

This should be the latest version of Caddy:

```
paste here
```


## 2. Description

### 2a. What happens (briefly explain what is wrong)




### 2b. Why it's a bug (if it's not obvious)




### 2c. Log output

```
paste terminal output or logs here
```



### 2d. Workaround(s)




### 2e. Relevant links




## 3. Tutorial (minimal steps to reproduce the bug)




Instructions -- please heed otherwise we cannot help you (help us help you!)

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials. Please enable debug and access logs.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file:
{ ... }

Open terminal and run Caddy:

$ caddy ...

Make an HTTP request:

$ curl ...

Notice that the result is ___ but it should be ___.

mholt avatar Jun 18 '24 23:06 mholt