serving icon indicating copy to clipboard operation
serving copied to clipboard

Activator sometimes crashing when requests time out

Open norman465 opened this issue 8 months ago • 12 comments

What version of Knative?

Knative v1.17 Go 1.23.x

Expected Behavior

Requests can time out and not cause the activator to crash

Actual Behavior

The activator may run into a panic and crash when a request is running into a timeout

Investigation

This seems to be due to a race condition in the activator when requests run into a timeout.

The last log line in this case may e.g. be error reverse proxying request printed from the knative error handler that is passed to the reserve proxy handler. This is then followed by a panic and crashing activator container due to: fatal error: concurrent map read and map write. The map in question is the header map of the HTTP response writer.

Stacktrace from the crashing go routine in golang net/http:

fatal error: concurrent map read and map write

goroutine 28502346 [running]:
net/http.(*chunkWriter).writeHeader(0xc041e3a2e0, {0xc0350a2000, 0x19, 0x800})
	net/http/server.go:1493 +0x8f3
net/http.(*chunkWriter).Write(0xc041e3a2e0, {0xc0350a2000, 0x19, 0x800})
	net/http/server.go:376 +0x37
bufio.(*Writer).Flush(0xc01bd772c0)
	bufio/bufio.go:639 +0x55
net/http.(*response).finishRequest(0xc041e3a2a0)
	net/http/server.go:1715 +0x45
net/http.(*conn).serve(0xc028e0d5f0, {0x24d3d58, 0xc01e1ece70})
	net/http/server.go:2098 +0x615
created by net/http.(*Server).Serve in goroutine 1591
	net/http/server.go:3360 +0x485

On timeout: in pkg/http/handler/timeout.go the timeout handler is:

  1. racing with the inner reverse proxy handler to write an error to the HTTP response
  2. cancel the context of the inner reverse proxy handler
  3. return

After the timeout handler returns the inner reserve proxy handler is still processing the canceled context and will continue to write headers of the HTTP response. The headers are of type map[string]string and are read/written to without any synchronization. The timeout handler uses a lock to synchronizes writes with the inner handler, but that does not cover the header map.

In our above error case the inner reverse proxy handler received the cancel and then called the knative error handler, because the context was canceled https://github.com/knative/pkg/blob/a877090f011ffdff7227c436d9553d7ca4699bc1/network/error_handler.go#L30. This called https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/server.go#L2336 which does the actual header map write. The error handler will delete elements from and add elements to the header map of the HTTP response while the timeout handler itself is returning to https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/server.go#L1699 from which it was spawned upon receiving a request, which is then trying to read the header map when finishing the request. This then may trigger a panic and subsequent crash of the activator.

If the activator received a response at the same time as the timeout occurred the inner reverse proxy handler may also still be doing concurrent operations on the header map e.g.: https://github.com/golang/go/blob/7b263895f7dbe81ddd7c0fc399e6a9ae6fe2f5bf/src/net/http/httputil/reverseproxy.go#L518

This is crashing our activators once every few days.

Steps to Reproduce the Problem

Difficult since it's a race condition. In theory, the race condition can likely occur any time a request times out.

norman465 avatar Apr 17 '25 11:04 norman465

One possible fix I can think of is for the timeout handler to try to wait for the inner reverse proxy handler to finish after it has canceled its context instead of returning immediately. It may return once either

  • the inner handler finished or
  • another timeout has passed to avoid blocking forever if there is an issue in the inner handler

norman465 avatar Apr 17 '25 11:04 norman465

Do you have a full stack trace - you can add here?

dprotaso avatar Apr 17 '25 13:04 dprotaso

@dprotaso The full stacktrace has around 40k lines, which I can provide if absolutely necessary.

I think the matching routine that was created from the go routine mentioned above [28502346] https://github.com/knative/serving/issues/15850#issue-3002223697 should hopefully suffice:

goroutine 28575769 [runnable]:
go.uber.org/zap/buffer.(*Buffer).AppendString(0xc04bf7b120, {0xc046b328a0?, 0x1e?})
	go.uber.org/[email protected]/buffer/buffer.go:51 +0xdf
go.uber.org/zap/zapcore.safeAppendStringLike[...](0x22ce998?, 0x22cea28, 0xc04bf7b120, {0xc046b328a0, 0x1e})
	go.uber.org/[email protected]/zapcore/json_encoder.go:582 +0x7dc
go.uber.org/zap/zapcore.(*jsonEncoder).safeAddString(...)
	go.uber.org/[email protected]/zapcore/json_encoder.go:489
go.uber.org/zap/zapcore.(*jsonEncoder).AppendString(0xc024681ec0, {0xc046b328a0, 0x1e})
	go.uber.org/[email protected]/zapcore/json_encoder.go:295 +0x219
github.com/blendle/zapdriver.RFC3339NanoTimeEncoder({0x45a029?, 0x22ce998?, 0x36fe640?}, {0x24ed328, 0xc024681ec0})
	github.com/blendle/[email protected]/encoder.go:58 +0x44
go.uber.org/zap/zapcore.(*jsonEncoder).AppendTime(0xc024681ec0, {0x21b8a06?, 0x9?, 0x36fe640?})
	go.uber.org/[email protected]/zapcore/json_encoder.go:309 +0x5b
go.uber.org/zap/zapcore.(*jsonEncoder).AddTime(0xc024681ec0, {0x21b8a06?, 0xc024681ec0?}, {0x36505c8?, 0x48a1dd?, 0x36fe640?})
	go.uber.org/[email protected]/zapcore/json_encoder.go:202 +0x4a
go.uber.org/zap/zapcore.(*jsonEncoder).EncodeEntry(0xc0004ee380, {0x0, {0xc1f6a2bac296d350, 0x9ec9fcba4047, 0x36fe640}, {0x21b7f95, 0x9}, {0xc046f50500, 0x78}, {0x1, ...}, ...}, ...)
	go.uber.org/[email protected]/zapcore/json_encoder.go:376 +0x1ba
go.uber.org/zap/zapcore.(*ioCore).Write(0xc0004df590, {0x0, {0xc1f6a2bac296d350, 0x9ec9fcba4047, 0x36fe640}, {0x21b7f95, 0x9}, {0xc046f50500, 0x78}, {0x1, ...}, ...}, ...)
	go.uber.org/[email protected]/zapcore/core.go:95 +0x7b
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0244a8340, {0x0, 0x0, 0x0})
	go.uber.org/[email protected]/zapcore/entry.go:253 +0x11c
go.uber.org/zap.(*SugaredLogger).log(0xc00009ee28, 0x0, {0x22057d6?, 0x32?}, {0xc0082b7aa0?, 0x4?, 0x4?}, {0x0, 0x0, 0x0})
	go.uber.org/[email protected]/sugar.go:355 +0xec
go.uber.org/zap.(*SugaredLogger).Infof(...)
	go.uber.org/[email protected]/sugar.go:204
knative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest(0xc000760a20, {{0xc05a6ce420?, 0xc013078000?}, {0xc04d891b18?, 0xc01b773f80?}}, {0x24c4f10, 0xc01b904190}, 0xc0262f8a00, {0xc01ba896d0, 0x10}, ...)
	knative.dev/serving/pkg/activator/handler/handler.go:148 +0x71f
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1({0xc01ba896d0, 0x10})
	knative.dev/serving/pkg/activator/handler/handler.go:98 +0x317
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1()
	knative.dev/serving/pkg/activator/net/throttler.go:242 +0x6a
knative.dev/serving/pkg/queue.(*Breaker).Maybe(0x60?, {0x24d3d90?, 0xc01b904140?}, 0xc01b773680)
	knative.dev/serving/pkg/queue/breaker.go:152 +0xf9
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try(0xc01daa0c00, {0x24d3d90, 0xc01b904140}, 0xc03ff0dce0)
	knative.dev/serving/pkg/activator/net/throttler.go:232 +0x106
knative.dev/serving/pkg/activator/net.(*Throttler).Try(0x2a5?, {0x24d3d90, 0xc01b904140}, {{0xc05a6ce420?, 0xc0262f9f80?}, {0xc04d891b18?, 0x10?}}, 0xc03ff0dce0)
	knative.dev/serving/pkg/activator/net/throttler.go:527 +0x5d
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP(0xc000760a20, {0x24c4f10, 0xc01b904190}, 0xc0262f88c0)
	knative.dev/serving/pkg/activator/handler/handler.go:91 +0x2d3
knative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4()
	knative.dev/serving/pkg/http/handler/timeout.go:118 +0x130
created by knative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP in goroutine 28502346
	knative.dev/serving/pkg/http/handler/timeout.go:110 +0x46e

For debugging reasons we added some logging to the (inner) handler that is messing up the stacktrace. The code snippet for that is https://github.com/knative/serving/blob/9dfd12deda0c50be2c6fd1c3ca30f2fd59e0c947/pkg/activator/handler/handler.go#L149. We changed the proxy.ServeHTTP(w, r) to:

start := time.Now()
proxy.ServeHTTP(w, r)
a.logger.Infof("TIMING Proxied request %s for revision %s/%s in %s", r.URL.String(), revID.Namespace, revID.Name, time.Since(start).String())

norman465 avatar Apr 28 '25 11:04 norman465

Hi, it seems that we pass down to the reverse proxy the timeoutWriter:

type timeoutWriter struct {
	w     http.ResponseWriter
	clock clock.PassiveClock

	mu            sync.Mutex
	timedOut      bool
	lastWriteTime time.Time
}

I think we could replace w with mutex protected reponse writer. WDYTH? cc @dprotaso

skonto avatar May 06 '25 12:05 skonto

/assign @skonto

skonto avatar May 07 '25 11:05 skonto

We might be able to simplify our timeout handler using

https://pkg.go.dev/net/http#ResponseController - which can set deadlines on each request

dprotaso avatar May 09 '25 15:05 dprotaso

Sure, does that controller help with the concurrent write in this case? If not, we can fix this now and migrate later, as the next step.

skonto avatar May 09 '25 15:05 skonto

Sure, does that controller help with the concurrent write in this case? If not, we can fix this now and migrate later, as the next step.

Unsure - would have to test first

dprotaso avatar May 10 '25 01:05 dprotaso

I wonder if that isn't related to Mike's proposal here: https://github.com/knative/serving/pull/15900

dsimansk avatar May 29 '25 09:05 dsimansk

Hey! Most likely https://github.com/knative/serving/pull/15899 fixed this issue. At least I couldn't reproduce it after that. I’m going to test it more in case some changes in the activator are also required.

Fedosin avatar Jun 11 '25 11:06 Fedosin

@Fedosin the description sounds a bit different than what we are seeing, but we will adopt that fix and see if that also solves this issue for us👍

norman465 avatar Jun 17 '25 07:06 norman465

@Fedosin we have adopted your fixes, but we are still seeing the same issue:

{"severity":"ERROR","timestamp":"2025-07-28T20:59:00.211461969Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 1037\nTCP: inuse 981 orphan 1 tw 55 alloc 1332 mem 169\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"30caeab-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-7759c669fc-ktq2k","knative.dev/key":"<namespace>/<name>","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:147\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:486\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:151\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:98\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:243\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:152\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:233\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:528\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:91\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
{"severity":"INFO","timestamp":"2025-07-28T20:59:00.211573308Z","logger":"activator","caller":"handler/handler.go:152","message":"TIMING Proxied request / for revision <namespace>/<name> in 5m0.000977191s","commit":"30caeab-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-7759c669fc-ktq2k"}
{"severity":"ERROR","timestamp":"2025-07-28T20:59:00.211461967Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 1037\nTCP: inuse 981 orphan 1 tw 55 alloc 1332 mem 169\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"30caeab-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-7759c669fc-ktq2k","knative.dev/key":"<namespace>/<name>","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:147\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:486\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:151\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:98\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:243\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:152\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:233\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:528\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:91\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}

fatal error: 
concurrent map read and map write

goroutine 4978323 [running]:
internal/runtime/maps.fatal({0x22bf914?, 0x14?})
	runtime/panic.go:1058 +0x18
net/http.(*chunkWriter).writeHeader.func1(...)
	net/http/server.go:1328
net/http.(*chunkWriter).writeHeader(0xc007dafe00, {0xc000f19000, 0x19, 0x800})
	net/http/server.go:1526 +0xef4
net/http.(*chunkWriter).Write(0xc007dafe00, {0xc000f19000, 0x19, 0x800})
	net/http/server.go:376 +0x37
bufio.(*Writer).Flush(0xc002e1c640)
	bufio/bufio.go:643 +0x55
net/http.(*response).finishRequest(0xc007dafdc0)
	net/http/server.go:1715 +0x45
net/http.(*conn).serve(0xc004506ab0, {0x25c4158, 0xc003e35530})
	net/http/server.go:2108 +0x665
created by net/http.(*Server).Serve in goroutine 1622
	net/http/server.go:3454 +0x485

unfortunately, the co-routine of the activation handler already finished here and is therefore not part of this stacktrace.

In the meantime we also upgraded to knative v1.18.1 and Go 1.24.x, but the issue persists.

norman465 avatar Aug 07 '25 08:08 norman465