serving
serving copied to clipboard
Knative Timeout Issues with Long-Running Requests
In what area(s)?
/area autoscale
What version of Knative?
1.13.1
Expected Behavior
Knative Service Pod and Queue Proxy respect the revision timeout passed.
Actual Behavior
Knative queue-proxy throws error:
{"severity":"ERROR","timestamp":"2024-06-25T23:47:00.022775794Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 13\nTCP: inuse 3 orphan 4 tw 14 alloc 2483 mem 309\nUDP: inuse 0 mem 8\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"41769de","knative.dev/key":"dxcm/plugin-github-00008","knative.dev/pod":"plugin-github-00008-deployment-6749cf875b-v9dhz","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
This causes an endless amount of retries resulting in:
plugin-github-00008-deployment-6749cf875b-jm55n 1/2 Terminating 0 123m
plugin-github-00008-deployment-6749cf875b-lbpsf 1/2 Terminating 0 5h16m
plugin-github-00008-deployment-6749cf875b-qxnhz 1/2 Terminating 0 157m
plugin-github-00008-deployment-6749cf875b-rpm5g 1/2 Terminating 0 3h58m
plugin-github-00008-deployment-6749cf875b-v9dhz 2/2 Running 0 15m
# curl --location 'http://plugin-github-test.dxcm.svc.cluster.local' \
--header 'Ce-Id: replay-189098f' \
--header 'Ce-Specversion: 1.0' \
--header 'Ce-Type: attestation.test' \
--header 'Ce-Source: ci.commit.codereview.test' \
--header 'Content-Type: application/json' \
--data '{}'> > > > > >
activator request timeout#
a bunch of pods that have to be manually deleted.
Steps to Reproduce the Problem
Apply the following configmap:
apiVersion: v1
data:
enable-service-links: "false"
max-revision-timeout-seconds: "604800"
revision-idle-timeout-seconds: "0"
revision-response-start-timeout-seconds: "600"
revision-timeout-seconds: "18500"
kind: ConfigMap
metadata:
annotations:
knative.dev/example-checksum: e7973912
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"enable-service-links":"false","max-revision-timeout-seconds":"604800","revision-idle-timeout-seconds":"0","revision-response-start-timeout-seconds":"600","revision-timeout-seconds":"18500"},"kind":"ConfigMap","metadata":{"annotations":{"knative.dev/example-checksum":"e7973912"},"creationTimestamp":"2023-11-28T16:59:46Z","labels":{"app.kubernetes.io/component":"controller","app.kubernetes.io/name":"knative-serving","app.kubernetes.io/version":"1.13.1"},"name":"config-defaults","namespace":"knative-serving","resourceVersion":"488776669","uid":"2496e813-166c-4756-aa7f-f1a83f946cdd"}}
creationTimestamp: "2023-11-28T16:59:46Z"
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: knative-serving
app.kubernetes.io/version: 1.13.1
name: config-defaults
namespace: knative-serving
resourceVersion: "488803732"
uid: 2496e813-166c-4756-aa7f-f1a83f946cdd
Restart knative serving
kubectl delete pods -n knative-serving --all
Apply a new knative service, that has a delay in it
Apply a new trigger reading from a broker
Send a request to that broker, that will activate the trigger:
logs from service
Defaulted container "github-plugin" out of: github-plugin, queue-proxy
2024/06/26 00:58:16 /var/run/ko
2024/06/26 00:58:16 /var/run/ko
Using config file: /var/run/ko/.config.yaml
time="2024-06-26T00:58:17Z" level=info msg="beginnig to sleep."
time="2024-06-26T01:08:17Z" level=info msg="beginnig to sleep."
logs from queue proxy
{"severity":"ERROR","timestamp":"2024-06-26T01:08:15.397562115Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 9\nTCP: inuse 1 orphan 2 tw 12 alloc 207 mem 9\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"41769de","knative.dev/key":"dxcm/plugin-github-test-00001","knative.dev/pod":"plugin-github-test-00001-deployment-76cd4695bf-nfzch","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
timeout difference: 2024-06-26T00:58:17Z vs 2024-06-26T01:08:15.397562115Z
indicating a 10 minute timeout.
error from activator:
{"severity":"ERROR","timestamp":"2024-06-26T01:08:15.397202332Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 146\nTCP: inuse 8 orphan 2 tw 99 alloc 2264 mem 285\nUDP: inuse 0 mem 8\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"41769de","knative.dev/controller":"activator","knative.dev/pod":"activator-b7789d69d-lcwb7","knative.dev/key":"dxcm/plugin-github-test-00001","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:141\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:144\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:97\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:237\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\tknative.dev/serving/pkg/activator/net/throttler.go:767\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:227\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:522\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:90\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}
activator termination
terminationGracePeriodSeconds: 604800
line of code in activator that starts timeout: activator request timeou
seems to be referencing: https://github.com/knative/serving/blob/main/pkg/apis/config/defaults.go#L45
(the max timeout) instead of the specified max timeout from me.
More Validation
My knative service spec:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
annotations:
creationTimestamp: "2024-06-25T02:32:13Z"
generation: 8
labels:
app.fianu.io/type: occurrence
app.kubernetes.io/component: adapter
app.kubernetes.io/instance: github
app.kubernetes.io/managed-by: fianu-controller
app.kubernetes.io/name: plugin
app.kubernetes.io/part-of: fianu
argocd.argoproj.io/compare-options: IgnoreExtraneous
destinations.fianu.io/broker: fianu-default
networking.knative.dev/visibility: cluster-local
name: plugin-github-test
namespace: dxcm
spec:
template:
metadata:
annotations:
argocd.argoproj.io/compare-options: IgnoreExtraneous
sidecar.istio.io/proxyCPU: 50m
sidecar.istio.io/proxyCPULimit: 150m
sidecar.istio.io/proxyMemory: 128Mi
sidecar.istio.io/proxyMemoryLimit: 256Mi
creationTimestamp: null
spec:
containerConcurrency: 0
containers:
- env:
- name: FIANU_PLATFORM_SEMANTIC_VERSION
valueFrom:
configMapKeyRef:
key: fianu_semantic_version
name: fianu-hosts
- configMapRef:
name: fianu-hosts-env
image: docker.io
imagePullPolicy: Always
name: github-plugin
readinessProbe:
successThreshold: 1
tcpSocket:
port: 0
resources: {}
securityContext: {}
responseStartTimeoutSeconds: 600
serviceAccountName: github
timeoutSeconds: 18500
timeoutSeconds
doesn't seem to be respected.
I also built the activator from source, and added loggin:
Is what I added:
log.Println("making request with values ", time.Duration(*rev.Spec.TimeoutSeconds)*time.Second, responseStartTimeout, idleTimeout)
From Broker:
2024/06/26 01:46:14 makign request with values 168h0m0s 10m0s 0s
With Curl
2024/06/26 01:47:53 makign request with values 168h0m0s 10m0s 0s
So, it looks like the timeout is being set properly from the spec.TimeoutSeconds
My only thoughts now, are is my delivery timeout not being respected:
spec:
broker: default
delivery:
timeout: PT2H
retry: 1
filter:
attributes:
source: ci.commit.codereview.test
type: attestation.test
subscriber:
ref:
apiVersion: serving.knative.dev/v1
kind: Service
name: plugin-github-test
namespace: test
But this does not make sense. Because I reproduce with curl:
# curl --location 'http://plugin-github-test.dxcm.svc.cluster.local' \
--header 'Ce-Id: replay-189098f' \
--header 'Ce-Specversion: 1.0' \
--header 'Ce-Type: attestation.test' \
--header 'Ce-Source: ci.commit.codereview.test' \
--header 'Content-Type: application/json' \
--data '{}'> > > > > >
activator request timeout#
which wouldn't even go through kafka, and I get the same timeout.
Is there something I need to do on the kafka side?