caddy icon indicating copy to clipboard operation
caddy copied to clipboard

Caddy v2.8.4/v2.10.0 Randomly Enters 502 State Across All Domains, Possibly SSE-Related

Open syedrizwansrs opened this issue 7 months ago • 8 comments

We are experiencing a problem where Caddy replies with 502's, the server appears to transition into this locked state where any request sent to any hosted domain even on separate IP addresses with separate back-end's will return a 502, dispite the backend servers being online.

This problem does not directly correlate with high load, or with duration that the server is running. However one correlating factor was a recent introduction of SSE, which maybe 50 max SSE connections would be active on one of the domains hosted.

We are using version v2.8.4 h1:q3pe0wpBj1OcHFZ3n/1nl4V4bxBrYoSoab7rL9BMYNk= and also tried version v2.10.0 h1:fonubSaQKF1YANl8TXqGcn4IbIRUDdfAkpcsfI/vX5U=, both exhibit the same problem.

The config we have is as follows (IP's and domains obfuscated):

caddy-sample-config.txt

We have currently written a script which checks the logs for a large number of 502's and restarts the server, but we can catch this scenario and take the BPG route off for this server. So if there is any specific debugging that you would like us to run when it happens, please let us know.

Thanks

syedrizwansrs avatar May 22 '25 11:05 syedrizwansrs

Thanks for opening an issue! We'll look into this.

It's not immediately clear to me what is going on, so I'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible using the latest version of Caddy. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.

This template will ask for some information you've already provided; that's OK, just fill it out the best you can. :+1: I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```


### 1b. Caddy version (run `caddy version` or paste commit SHA)

This should be the latest version of Caddy:

```
paste here
```


## 2. Description

### 2a. What happens (briefly explain what is wrong)




### 2b. Why it's a bug (if it's not obvious)




### 2c. Log output

```
paste terminal output or logs here
```



### 2d. Workaround(s)




### 2e. Relevant links




## 3. Tutorial (minimal steps to reproduce the bug)




Instructions -- please heed otherwise we cannot help you (help us help you!)

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials. Please enable debug and access logs.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file:
{ ... }

Open terminal and run Caddy:

$ caddy ...

Make an HTTP request:

$ curl ...

Notice that the result is ___ but it should be ___.

mholt avatar Jun 10 '25 20:06 mholt

1. Environment

1a. Operating system and version

Ubuntu 22.04.4 LTS (GNU/Linux 5.15.0-134-generic x86_64)

1b. Caddy version (run caddy version or paste commit SHA)

This should be the latest version of Caddy:

v2.10.0 h1:fonubSaQKF1YANl8TXqGcn4IbIRUDdfAkpcsfI/vX5U=

2. Description

2a. What happens (briefly explain what is wrong)

Occasionally, the server appears to go into an "error state", when this happens all domains (including different IP addresses, with different reverse_proxy backends) will only return 502 errors. Annoyingly we have not been able to correlate this incident with anything that happens. It is NOT time dependant as we have observed it happen only minutes after restarting the server. It seemed to happen more frequently when we implemented SSE on one of our domains.

2b. Why it's a bug (if it's not obvious)

During the incident there

  1. No notable failure on the back end, they remain online from other nodes (caddy servers running on a different anycast address serving a different zone). Nor does our monitoring system report that it goes offline. However we cant be certain that a single request fails somehow with an unexpected response code.
  2. More importantly, even if 1. resulted in a localized temporary outage on a domain, it should not affect the whole server and all the domains hosted on it.

2c. Log output

The logs have been emailed to you direcy.

We’ve reviewed the relevant log entries and are sharing a filtered excerpt that includes TLS activity and reverse proxy warnings such as "aborting with incomplete response". These were extracted from the access logs and timestamp-formatted for easier reading. Apart from these entries, we did not observe anything else that directly correlates with the issue. If helpful, we can also provide raw access logs for the same period. Please let us know if that would be useful, or if there are any other diagnostic steps you’d recommend we take the next time this occurs.

2d. Workaround(s)

To mitigate the issue, we currently monitor the number of 502 Bad Gateway responses being returned by the server. When the failure rate crosses a certain threshold, we automatically trigger a restart of the Caddy service. This restores normal functionality temporarily. However, this is only a stopgap measure and does not address the root cause.

2e. Relevant links

None.

3. Tutorial (minimal steps to reproduce the bug)

Unfortunately, the issue cannot be reproduced on demand. It occurs intermittently without any clear pattern or trigger. We have observed that temporarily removing the server from the Anycast network and bringing it back online after a short interval often resolves the issue. Please let us know what steps you’d recommend.

syedrizwansrs avatar Jun 11 '25 13:06 syedrizwansrs

Thank you -- will begin looking into this!

mholt avatar Jun 12 '25 03:06 mholt

@mholt I am not sure if any of this is relevent, but here are 3 file dumps from when the error was happening

goroutine.txt heap.txt threadcreate.txt

If you have any steps for us to perform when it gets in this locked up state, I would be happy to do so. Thanks.

digipigeon avatar Jun 22 '25 11:06 digipigeon

Hello, I have a further update for you with regards to this issue.

We have now been able to confirm that intermittent downtime caused by netlify is in turn triggering this issue. We incorrectly assumed that netlify was solid (especially over multiple zones) and did not do as through investigation granular into those endpoints. We have replicated netlify going down from external curl requests. So that is the trigger, I am following up with netlify about this now.

However there still seems to be an issue here and that is. As soon as netlify goes down on 1 domain, it still triggers an outage on everything else. All sites hosted return 502 because of an outage on 1+ backends.

When i get a sec I will see if we can setup a scenario to replicate this and provide you more information.

digipigeon avatar Jun 26 '25 10:06 digipigeon

Just adding another (non-scientific) data point that we have observed similar behavior where caddy "slows down" and may produce some 502s on backends that we know are up (by visiting directly).

One recent variable introduction is we are running a reverse_proxy to an api backend which makes heavy use of "content-type text/event-stream; charset=utf-8" / SSE traffic as well.

A quick down/up of caddy brings back the snappiness and all is well from there (until we next notice the slowdown).

Apologies for the vagueness of this, I will do my best to collect more data as we notice it again.

Ubuntu / Docker Caddy v2.10.0 h1:fonubSaQKF1YANl8TXqGcn4IbIRUDdfAkpcsfI/vX5U=
(with a bunch of plugins)

 Standard modules: 127

admin.api.souin
cache
caddy.listeners.layer4
http.authentication.providers.authorizer
http.handlers.authenticator
http.handlers.cache
http.handlers.replace_response
layer4
layer4.handlers.echo
layer4.handlers.proxy
layer4.handlers.proxy_protocol
layer4.handlers.socks5
layer4.handlers.subroute
layer4.handlers.tee
layer4.handlers.throttle
layer4.handlers.tls
layer4.matchers.clock
layer4.matchers.dns
layer4.matchers.http
layer4.matchers.local_ip
layer4.matchers.not
layer4.matchers.openvpn
layer4.matchers.postgres
layer4.matchers.proxy_protocol
layer4.matchers.quic
layer4.matchers.rdp
layer4.matchers.regexp
layer4.matchers.remote_ip
layer4.matchers.remote_ip_list
layer4.matchers.socks4
layer4.matchers.socks5
layer4.matchers.ssh
layer4.matchers.tls
layer4.matchers.winbox
layer4.matchers.wireguard
layer4.matchers.xmpp
layer4.proxy.selection_policies.first
layer4.proxy.selection_policies.ip_hash
layer4.proxy.selection_policies.least_conn
layer4.proxy.selection_policies.random
layer4.proxy.selection_policies.random_choose
layer4.proxy.selection_policies.round_robin
security
storages.cache.badger
storages.cache.otter
tls.handshake_match.alpn

  Non-standard modules: 46

Just wanted to signal the original post highly resonated with our experience as well. I'll do my best to source more info as requested.

jotterbot avatar Jul 15 '25 02:07 jotterbot

Sorry for the delay in following up here, we have a lot of moving parts here and its been difficultt to zone in on what the root cause is.

The response from Netlify for our issue is "It does appear that there are intermittent routing issue between the hosting service you are using and the AWS infrastructure used for our CDN nodes"

Unfortunatly we have not been able to rectify that issue any further. However we have bolstered stability dramatically by serving stale content when the backend goes down.

There still does feel like there is a nebulous internal bug which seems to not contain an individual problem to the area affected. E.g one system experienced a failure and it manifests itself on other domains or other unrelated routes.

Sorry that my answer itself is vague, I have done by best to anicdotally explain our experience.

digipigeon avatar Aug 13 '25 10:08 digipigeon