Activator sometimes crashing when requests time out
What version of Knative?
Knative v1.17 Go 1.23.x
Expected Behavior
Requests can time out and not cause the activator to crash
Actual Behavior
The activator may run into a panic and crash when a request is running into a timeout
Investigation
This seems to be due to a race condition in the activator when requests run into a timeout.
The last log line in this case may e.g. be error reverse proxying request printed from the knative error handler that is passed to the reserve proxy handler.
This is then followed by a panic and crashing activator container due to: fatal error: concurrent map read and map write. The map in question is the header map of the HTTP response writer.
Stacktrace from the crashing go routine in golang net/http:
fatal error: concurrent map read and map write
goroutine 28502346 [running]:
net/http.(*chunkWriter).writeHeader(0xc041e3a2e0, {0xc0350a2000, 0x19, 0x800})
net/http/server.go:1493 +0x8f3
net/http.(*chunkWriter).Write(0xc041e3a2e0, {0xc0350a2000, 0x19, 0x800})
net/http/server.go:376 +0x37
bufio.(*Writer).Flush(0xc01bd772c0)
bufio/bufio.go:639 +0x55
net/http.(*response).finishRequest(0xc041e3a2a0)
net/http/server.go:1715 +0x45
net/http.(*conn).serve(0xc028e0d5f0, {0x24d3d58, 0xc01e1ece70})
net/http/server.go:2098 +0x615
created by net/http.(*Server).Serve in goroutine 1591
net/http/server.go:3360 +0x485
On timeout: in pkg/http/handler/timeout.go the timeout handler is:
- racing with the inner reverse proxy handler to write an error to the HTTP response
- cancel the context of the inner reverse proxy handler
- return
After the timeout handler returns the inner reserve proxy handler is still processing the canceled context and will continue to write headers of the HTTP response.
The headers are of type map[string]string and are read/written to without any synchronization.
The timeout handler uses a lock to synchronizes writes with the inner handler, but that does not cover the header map.
In our above error case the inner reverse proxy handler received the cancel and then called the knative error handler, because the context was canceled https://github.com/knative/pkg/blob/a877090f011ffdff7227c436d9553d7ca4699bc1/network/error_handler.go#L30. This called https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/server.go#L2336 which does the actual header map write. The error handler will delete elements from and add elements to the header map of the HTTP response while the timeout handler itself is returning to https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/server.go#L1699 from which it was spawned upon receiving a request, which is then trying to read the header map when finishing the request. This then may trigger a panic and subsequent crash of the activator.
If the activator received a response at the same time as the timeout occurred the inner reverse proxy handler may also still be doing concurrent operations on the header map e.g.: https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/httputil/reverseproxy.go#L518
This is crashing our activators once every few days.
Steps to Reproduce the Problem
Difficult since it's a race condition. In theory, the race condition can likely occur any time a request times out.
One possible fix I can think of is for the timeout handler to try to wait for the inner reverse proxy handler to finish after it has canceled its context instead of returning immediately. It may return once either
- the inner handler finished or
- another timeout has passed to avoid blocking forever if there is an issue in the inner handler
Do you have a full stack trace - you can add here?
@dprotaso The full stacktrace has around 40k lines, which I can provide if absolutely necessary.
I think the matching routine that was created from the go routine mentioned above [28502346] https://github.com/knative/serving/issues/15850#issue-3002223697 should hopefully suffice:
goroutine 28575769 [runnable]:
go.uber.org/zap/buffer.(*Buffer).AppendString(0xc04bf7b120, {0xc046b328a0?, 0x1e?})
go.uber.org/[email protected]/buffer/buffer.go:51 +0xdf
go.uber.org/zap/zapcore.safeAppendStringLike[...](0x22ce998?, 0x22cea28, 0xc04bf7b120, {0xc046b328a0, 0x1e})
go.uber.org/[email protected]/zapcore/json_encoder.go:582 +0x7dc
go.uber.org/zap/zapcore.(*jsonEncoder).safeAddString(...)
go.uber.org/[email protected]/zapcore/json_encoder.go:489
go.uber.org/zap/zapcore.(*jsonEncoder).AppendString(0xc024681ec0, {0xc046b328a0, 0x1e})
go.uber.org/[email protected]/zapcore/json_encoder.go:295 +0x219
github.com/blendle/zapdriver.RFC3339NanoTimeEncoder({0x45a029?, 0x22ce998?, 0x36fe640?}, {0x24ed328, 0xc024681ec0})
github.com/blendle/[email protected]/encoder.go:58 +0x44
go.uber.org/zap/zapcore.(*jsonEncoder).AppendTime(0xc024681ec0, {0x21b8a06?, 0x9?, 0x36fe640?})
go.uber.org/[email protected]/zapcore/json_encoder.go:309 +0x5b
go.uber.org/zap/zapcore.(*jsonEncoder).AddTime(0xc024681ec0, {0x21b8a06?, 0xc024681ec0?}, {0x36505c8?, 0x48a1dd?, 0x36fe640?})
go.uber.org/[email protected]/zapcore/json_encoder.go:202 +0x4a
go.uber.org/zap/zapcore.(*jsonEncoder).EncodeEntry(0xc0004ee380, {0x0, {0xc1f6a2bac296d350, 0x9ec9fcba4047, 0x36fe640}, {0x21b7f95, 0x9}, {0xc046f50500, 0x78}, {0x1, ...}, ...}, ...)
go.uber.org/[email protected]/zapcore/json_encoder.go:376 +0x1ba
go.uber.org/zap/zapcore.(*ioCore).Write(0xc0004df590, {0x0, {0xc1f6a2bac296d350, 0x9ec9fcba4047, 0x36fe640}, {0x21b7f95, 0x9}, {0xc046f50500, 0x78}, {0x1, ...}, ...}, ...)
go.uber.org/[email protected]/zapcore/core.go:95 +0x7b
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0244a8340, {0x0, 0x0, 0x0})
go.uber.org/[email protected]/zapcore/entry.go:253 +0x11c
go.uber.org/zap.(*SugaredLogger).log(0xc00009ee28, 0x0, {0x22057d6?, 0x32?}, {0xc0082b7aa0?, 0x4?, 0x4?}, {0x0, 0x0, 0x0})
go.uber.org/[email protected]/sugar.go:355 +0xec
go.uber.org/zap.(*SugaredLogger).Infof(...)
go.uber.org/[email protected]/sugar.go:204
knative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest(0xc000760a20, {{0xc05a6ce420?, 0xc013078000?}, {0xc04d891b18?, 0xc01b773f80?}}, {0x24c4f10, 0xc01b904190}, 0xc0262f8a00, {0xc01ba896d0, 0x10}, ...)
knative.dev/serving/pkg/activator/handler/handler.go:148 +0x71f
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1({0xc01ba896d0, 0x10})
knative.dev/serving/pkg/activator/handler/handler.go:98 +0x317
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1()
knative.dev/serving/pkg/activator/net/throttler.go:242 +0x6a
knative.dev/serving/pkg/queue.(*Breaker).Maybe(0x60?, {0x24d3d90?, 0xc01b904140?}, 0xc01b773680)
knative.dev/serving/pkg/queue/breaker.go:152 +0xf9
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try(0xc01daa0c00, {0x24d3d90, 0xc01b904140}, 0xc03ff0dce0)
knative.dev/serving/pkg/activator/net/throttler.go:232 +0x106
knative.dev/serving/pkg/activator/net.(*Throttler).Try(0x2a5?, {0x24d3d90, 0xc01b904140}, {{0xc05a6ce420?, 0xc0262f9f80?}, {0xc04d891b18?, 0x10?}}, 0xc03ff0dce0)
knative.dev/serving/pkg/activator/net/throttler.go:527 +0x5d
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP(0xc000760a20, {0x24c4f10, 0xc01b904190}, 0xc0262f88c0)
knative.dev/serving/pkg/activator/handler/handler.go:91 +0x2d3
knative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4()
knative.dev/serving/pkg/http/handler/timeout.go:118 +0x130
created by knative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP in goroutine 28502346
knative.dev/serving/pkg/http/handler/timeout.go:110 +0x46e
For debugging reasons we added some logging to the (inner) handler that is messing up the stacktrace. The code snippet for that is https://github.com/knative/serving/blob/9dfd12deda0c50be2c6fd1c3ca30f2fd59e0c947/pkg/activator/handler/handler.go#L149. We changed the proxy.ServeHTTP(w, r) to:
start := time.Now()
proxy.ServeHTTP(w, r)
a.logger.Infof("TIMING Proxied request %s for revision %s/%s in %s", r.URL.String(), revID.Namespace, revID.Name, time.Since(start).String())
Hi, it seems that we pass down to the reverse proxy the timeoutWriter:
type timeoutWriter struct {
w http.ResponseWriter
clock clock.PassiveClock
mu sync.Mutex
timedOut bool
lastWriteTime time.Time
}
I think we could replace w with mutex protected reponse writer. WDYTH? cc @dprotaso
/assign @skonto
We might be able to simplify our timeout handler using
https://pkg.go.dev/net/http#ResponseController - which can set deadlines on each request
Sure, does that controller help with the concurrent write in this case? If not, we can fix this now and migrate later, as the next step.
Sure, does that controller help with the concurrent write in this case? If not, we can fix this now and migrate later, as the next step.
Unsure - would have to test first
I wonder if that isn't related to Mike's proposal here: https://github.com/knative/serving/pull/15900
Hey! Most likely https://github.com/knative/serving/pull/15899 fixed this issue. At least I couldn't reproduce it after that. I’m going to test it more in case some changes in the activator are also required.
@Fedosin the description sounds a bit different than what we are seeing, but we will adopt that fix and see if that also solves this issue for us👍
@Fedosin we have adopted your fixes, but we are still seeing the same issue:
{"severity":"ERROR","timestamp":"2025-07-28T20:59:00.211461969Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 1037\nTCP: inuse 981 orphan 1 tw 55 alloc 1332 mem 169\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"30caeab-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-7759c669fc-ktq2k","knative.dev/key":"<namespace>/<name>","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:147\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:486\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:151\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:98\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:243\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:152\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:233\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:528\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:91\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
{"severity":"INFO","timestamp":"2025-07-28T20:59:00.211573308Z","logger":"activator","caller":"handler/handler.go:152","message":"TIMING Proxied request / for revision <namespace>/<name> in 5m0.000977191s","commit":"30caeab-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-7759c669fc-ktq2k"}
{"severity":"ERROR","timestamp":"2025-07-28T20:59:00.211461967Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 1037\nTCP: inuse 981 orphan 1 tw 55 alloc 1332 mem 169\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"30caeab-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-7759c669fc-ktq2k","knative.dev/key":"<namespace>/<name>","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:147\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:486\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:151\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:98\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:243\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:152\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:233\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:528\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:91\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
fatal error:
concurrent map read and map write
goroutine 4978323 [running]:
internal/runtime/maps.fatal({0x22bf914?, 0x14?})
runtime/panic.go:1058 +0x18
net/http.(*chunkWriter).writeHeader.func1(...)
net/http/server.go:1328
net/http.(*chunkWriter).writeHeader(0xc007dafe00, {0xc000f19000, 0x19, 0x800})
net/http/server.go:1526 +0xef4
net/http.(*chunkWriter).Write(0xc007dafe00, {0xc000f19000, 0x19, 0x800})
net/http/server.go:376 +0x37
bufio.(*Writer).Flush(0xc002e1c640)
bufio/bufio.go:643 +0x55
net/http.(*response).finishRequest(0xc007dafdc0)
net/http/server.go:1715 +0x45
net/http.(*conn).serve(0xc004506ab0, {0x25c4158, 0xc003e35530})
net/http/server.go:2108 +0x665
created by net/http.(*Server).Serve in goroutine 1622
net/http/server.go:3454 +0x485
unfortunately, the co-routine of the activation handler already finished here and is therefore not part of this stacktrace.
In the meantime we also upgraded to knative v1.18.1 and Go 1.24.x, but the issue persists.