Caddy crashes under load, with huge number of mutex locks in cache handler
Under reasonably small load (~250 req/sec), when using the cache-handler (we're using the go-redis storage, but I saw this happen with the default one as well), we get a fatal error: concurrent map iteration and map write in Caddy, this happens reasonably frequently (maybe once a day), requiring a restart.
fatal error: concurrent map iteration and map write
goroutine 27695297 [running]:
net/http.Header.Clone(...)
net/http/header.go:101
net/http.(*response).WriteHeader(0xc920c73c00, 0x1f4)
net/http/server.go:1231 +0x1e7
github.com/caddyserver/caddy/v2/modules/caddyhttp.(*responseRecorder).WriteHeader(0x1ef02e0?, 0x1a70889?)
github.com/caddyserver/caddy/[email protected]/modules/caddyhttp/responsewriter.go:167 +0xb6
github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).ServeHTTP(0xc00089a008, {0x1f00d70, 0xc920c73c00}, 0xcf30e57040)
github.com/caddyserver/caddy/[email protected]/modules/caddyhttp/server.go:415 +0x1535
net/http.serverHandler.ServeHTTP({0xcfd3b14750?}, {0x1f00d70?, 0xc920c73c00?}, 0x6?)
net/http/server.go:3210 +0x8e
net/http.(*conn).serve(0xcef56bf4d0, {0x1f04538, 0x1070ae8cba0})
net/http/server.go:2092 +0x5d0
created by net/http.(*Server).Serve in goroutine 181
net/http/server.go:3360 +0x485
This is accompanied by about 1.9 million of these goroutine "Sync.lock.mutex" states, implying (I guess) that a huge number of goroutines are waiting on a mutex.
goroutine 21446179 [sync.Mutex.Lock, 92 minutes]:
sync.runtime_SemacquireMutex(0xc918f18cb0?, 0xb8?, 0xc001a52708?)
runtime/sema.go:95 +0x25
sync.(*Mutex).lockSlow(0xc0008fc640)
sync/mutex.go:173 +0x15d
sync.(*Mutex).Lock(...)
sync/mutex.go:92
github.com/darkweak/souin/pkg/surrogate/providers.(*baseStorage).storeTag(0xc000a80360, {0x0, 0x0}, {0xc61aaf6ac0, 0x38}, 0xcb96172320)
github.com/darkweak/[email protected]/pkg/surrogate/providers/common.go:168 +0xad
github.com/darkweak/souin/pkg/surrogate/providers.(*baseStorage).Store(0xc000a80360, 0xc918f18ec0?, {0xc428c81400?, 0x90?}, {0x19dce40?, 0xcc0cfff7
github.com/darkweak/[email protected]/pkg/surrogate/providers/common.go:232 +0x27f
github.com/darkweak/souin/pkg/middleware.(*SouinBaseHandler).Store.func2({{0x0, 0x0}, 0x12d, {0x0, 0x0}, 0x0, 0x0, 0xca7ca1ea50, {0x1efafc8, 0xcd3d
github.com/darkweak/[email protected]/pkg/middleware/middleware.go:376 +0xcd
created by github.com/darkweak/souin/pkg/middleware.(*SouinBaseHandler).Store in goroutine 21446047
github.com/darkweak/[email protected]/pkg/middleware/middleware.go:375 +0x2891
2024-12-06T08:04:16.889267314Z status_proxy_caddy_caddy.0.t7ydd1vd5hya@statuspage-1 |
My theory is that the crash happens when some pool is finally exhausted.
Attached a log file containing the first 1000 lines, the full thing is many gigabytes (but I can provide if interested, they're all different goroutines waiting at the same point in the code)
Build:
FROM caddy:2.8.4-builder AS builder
RUN xcaddy build \
--with github.com/pberkel/[email protected] \
--with github.com/caddyserver/[email protected] \
--with github.com/darkweak/storages/go-redis/caddy \
--with github.com/aksdb/[email protected]
Caddyfile, some values redacted.
(ip_block) {
@block {
remote_ip REDACTED
}
respond @block "Access denied" 403
}
(host_block) {
@host_block {
host REDACTED
}
respond @host_block "Rate limited" 429
}
{
on_demand_tls {
ask https://REDACTED
}
storage redis {
host "{$REDIS_HOST}"
username "{$REDIS_USER}"
password "{$REDIS_PASSWORD}"
}
servers {
metrics
}
cache {
# how our cache shows in the response header; default is Souin.
cache_name sc
redis {
configuration {
Addrs "{$REDIS_CACHE_HOST}:6379"
User "{$REDIS_CACHE_USER}"
Password "{$REDIS_CACHE_PASSWORD}"
DB 0
}
}
}
order cgi before respond
}
# Catch-all for any domain
:443 {
log
tls {
# we do some funky certificate selection stuff with a local script, fallback to the default
get_certificate http http://localhost:4434/cert
on_demand
}
import ip_block
import host_block
cache {
key {
hide
headers Accept-Language
}
# ignore any cache-control indicators from the client
mode bypass_request
}
# Reverse proxy to server.betteruptime.com
reverse_proxy https://REDACTED {
header_up Host {upstream_hostport}
header_up X-Real-IP {http.request.remote.host}
}
}
http://:9190 {
metrics /metrics
}
http://localhost:4434 {
cache {
stale 5m
ttl 5m
}
cgi /cert /etc/caddy/get_fallback_cert.sh
}
@darkweak, did you have any thoughts on this issue or more information you require? Cache-handler is entirely unusable in production because of this issue, and I don't think it's related to a specific provider.
Hello, I have to write a patch that doesn't send the headers if that's an asynchronous revalidation (I think it tries to update the status code when the async revalidation is done) You can try with the latest commit of this PR https://github.com/darkweak/souin/pull/589 because it contains some patches related to the revalidation.