cache-handler Caddy crashes under load, with huge number of mutex locks in cache handler

Under reasonably small load (~250 req/sec), when using the cache-handler (we're using the go-redis storage, but I saw this happen with the default one as well), we get a fatal error: concurrent map iteration and map write in Caddy, this happens reasonably frequently (maybe once a day), requiring a restart.

 fatal error: concurrent map iteration and map write
 
 goroutine 27695297 [running]:
 net/http.Header.Clone(...)
 	net/http/header.go:101
 net/http.(*response).WriteHeader(0xc920c73c00, 0x1f4)
 	net/http/server.go:1231 +0x1e7
 github.com/caddyserver/caddy/v2/modules/caddyhttp.(*responseRecorder).WriteHeader(0x1ef02e0?, 0x1a70889?)
 	github.com/caddyserver/caddy/[email protected]/modules/caddyhttp/responsewriter.go:167 +0xb6
 github.com/caddyserver/caddy/v2/modules/caddyhttp.(*Server).ServeHTTP(0xc00089a008, {0x1f00d70, 0xc920c73c00}, 0xcf30e57040)
 	github.com/caddyserver/caddy/[email protected]/modules/caddyhttp/server.go:415 +0x1535
 net/http.serverHandler.ServeHTTP({0xcfd3b14750?}, {0x1f00d70?, 0xc920c73c00?}, 0x6?)
 	net/http/server.go:3210 +0x8e
 net/http.(*conn).serve(0xcef56bf4d0, {0x1f04538, 0x1070ae8cba0})
 	net/http/server.go:2092 +0x5d0
 created by net/http.(*Server).Serve in goroutine 181
 	net/http/server.go:3360 +0x485

This is accompanied by about 1.9 million of these goroutine "Sync.lock.mutex" states, implying (I guess) that a huge number of goroutines are waiting on a mutex.

goroutine 21446179 [sync.Mutex.Lock, 92 minutes]:
sync.runtime_SemacquireMutex(0xc918f18cb0?, 0xb8?, 0xc001a52708?)
	runtime/sema.go:95 +0x25
sync.(*Mutex).lockSlow(0xc0008fc640)
	sync/mutex.go:173 +0x15d
sync.(*Mutex).Lock(...)
	sync/mutex.go:92
github.com/darkweak/souin/pkg/surrogate/providers.(*baseStorage).storeTag(0xc000a80360, {0x0, 0x0}, {0xc61aaf6ac0, 0x38}, 0xcb96172320)
	github.com/darkweak/[email protected]/pkg/surrogate/providers/common.go:168 +0xad
github.com/darkweak/souin/pkg/surrogate/providers.(*baseStorage).Store(0xc000a80360, 0xc918f18ec0?, {0xc428c81400?, 0x90?}, {0x19dce40?, 0xcc0cfff7
	github.com/darkweak/[email protected]/pkg/surrogate/providers/common.go:232 +0x27f
github.com/darkweak/souin/pkg/middleware.(*SouinBaseHandler).Store.func2({{0x0, 0x0}, 0x12d, {0x0, 0x0}, 0x0, 0x0, 0xca7ca1ea50, {0x1efafc8, 0xcd3d
	github.com/darkweak/[email protected]/pkg/middleware/middleware.go:376 +0xcd
created by github.com/darkweak/souin/pkg/middleware.(*SouinBaseHandler).Store in goroutine 21446047
	github.com/darkweak/[email protected]/pkg/middleware/middleware.go:375 +0x2891
2024-12-06T08:04:16.889267314Z status_proxy_caddy_caddy.0.t7ydd1vd5hya@statuspage-1    |

My theory is that the crash happens when some pool is finally exhausted.

Attached a log file containing the first 1000 lines, the full thing is many gigabytes (but I can provide if interested, they're all different goroutines waiting at the same point in the code)

limited.txt

Build:

FROM caddy:2.8.4-builder AS builder

RUN xcaddy build \
    --with github.com/pberkel/[email protected] \
    --with github.com/caddyserver/[email protected] \
    --with github.com/darkweak/storages/go-redis/caddy \
    --with github.com/aksdb/[email protected]

Caddyfile, some values redacted.

(ip_block) {
    @block {             
        remote_ip REDACTED 
    }
    respond @block "Access denied" 403
}

(host_block) {
    @host_block {
        host REDACTED
    }
    respond @host_block "Rate limited" 429
}

{
    on_demand_tls {
        ask https://REDACTED
    }

    storage redis {
        host           "{$REDIS_HOST}"
        username       "{$REDIS_USER}"
        password       "{$REDIS_PASSWORD}"
    }

    servers {
        metrics
    }

    cache {
        # how our cache shows in the response header; default is Souin.
        cache_name sc

        redis {
            configuration {
                Addrs "{$REDIS_CACHE_HOST}:6379"
                User "{$REDIS_CACHE_USER}"
                Password "{$REDIS_CACHE_PASSWORD}"
                DB 0
            }
        }
    }

    order cgi before respond
}

# Catch-all for any domain
:443 {
    log

    tls {
        # we do some funky certificate selection stuff with a local script, fallback to the default
        get_certificate http http://localhost:4434/cert
        on_demand
    }

    import ip_block
    import host_block

    cache {
        key {
            hide
            headers Accept-Language
        }
        # ignore any cache-control indicators from the client
        mode bypass_request
    }

    # Reverse proxy to server.betteruptime.com
    reverse_proxy https://REDACTED {
        header_up Host {upstream_hostport}
        header_up X-Real-IP {http.request.remote.host}
    }
}

http://:9190 {
    metrics /metrics
}

http://localhost:4434 {
    cache {
        stale 5m
        ttl 5m
    }
    cgi /cert /etc/caddy/get_fallback_cert.sh
}

Dec 06 '24 12:12 alistairjevans

@darkweak, did you have any thoughts on this issue or more information you require? Cache-handler is entirely unusable in production because of this issue, and I don't think it's related to a specific provider.

Jan 02 '25 06:01 alistairjevans

Hello, I have to write a patch that doesn't send the headers if that's an asynchronous revalidation (I think it tries to update the status code when the async revalidation is done) You can try with the latest commit of this PR https://github.com/darkweak/souin/pull/589 because it contains some patches related to the revalidation.

Jan 02 '25 12:01 darkweak