serving icon indicating copy to clipboard operation
serving copied to clipboard

Knative Timeout Issues with Long-Running Requests

Open nkreiger opened this issue 7 months ago • 8 comments

In what area(s)?

/area autoscale

What version of Knative?

1.13.1

Expected Behavior

Knative Service Pod and Queue Proxy respect the revision timeout passed.

Actual Behavior

Knative queue-proxy throws error:

{"severity":"ERROR","timestamp":"2024-06-25T23:47:00.022775794Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 13\nTCP: inuse 3 orphan 4 tw 14 alloc 2483 mem 309\nUDP: inuse 0 mem 8\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"41769de","knative.dev/key":"dxcm/plugin-github-00008","knative.dev/pod":"plugin-github-00008-deployment-6749cf875b-v9dhz","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}

This causes an endless amount of retries resulting in:

plugin-github-00008-deployment-6749cf875b-jm55n                   1/2     Terminating         0               123m
plugin-github-00008-deployment-6749cf875b-lbpsf                   1/2     Terminating         0               5h16m
plugin-github-00008-deployment-6749cf875b-qxnhz                   1/2     Terminating         0               157m
plugin-github-00008-deployment-6749cf875b-rpm5g                   1/2     Terminating         0               3h58m
plugin-github-00008-deployment-6749cf875b-v9dhz                   2/2     Running             0               15m
# curl --location 'http://plugin-github-test.dxcm.svc.cluster.local' \
--header 'Ce-Id: replay-189098f' \
--header 'Ce-Specversion: 1.0' \
--header 'Ce-Type: attestation.test' \
--header 'Ce-Source: ci.commit.codereview.test' \
--header 'Content-Type: application/json' \
--data '{}'> > > > > > 
activator request timeout# 

a bunch of pods that have to be manually deleted.

Steps to Reproduce the Problem

Apply the following configmap:

apiVersion: v1
data:
  enable-service-links: "false"
  max-revision-timeout-seconds: "604800"
  revision-idle-timeout-seconds: "0"
  revision-response-start-timeout-seconds: "600"
  revision-timeout-seconds: "18500"
kind: ConfigMap
metadata:
  annotations:
    knative.dev/example-checksum: e7973912
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"enable-service-links":"false","max-revision-timeout-seconds":"604800","revision-idle-timeout-seconds":"0","revision-response-start-timeout-seconds":"600","revision-timeout-seconds":"18500"},"kind":"ConfigMap","metadata":{"annotations":{"knative.dev/example-checksum":"e7973912"},"creationTimestamp":"2023-11-28T16:59:46Z","labels":{"app.kubernetes.io/component":"controller","app.kubernetes.io/name":"knative-serving","app.kubernetes.io/version":"1.13.1"},"name":"config-defaults","namespace":"knative-serving","resourceVersion":"488776669","uid":"2496e813-166c-4756-aa7f-f1a83f946cdd"}}
  creationTimestamp: "2023-11-28T16:59:46Z"
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: knative-serving
    app.kubernetes.io/version: 1.13.1
  name: config-defaults
  namespace: knative-serving
  resourceVersion: "488803732"
  uid: 2496e813-166c-4756-aa7f-f1a83f946cdd

Restart knative serving

kubectl delete pods -n knative-serving --all

Apply a new knative service, that has a delay in it

Apply a new trigger reading from a broker

Send a request to that broker, that will activate the trigger:

logs from service

Defaulted container "github-plugin" out of: github-plugin, queue-proxy
2024/06/26 00:58:16 /var/run/ko
2024/06/26 00:58:16 /var/run/ko
Using config file: /var/run/ko/.config.yaml
time="2024-06-26T00:58:17Z" level=info msg="beginnig to sleep."
time="2024-06-26T01:08:17Z" level=info msg="beginnig to sleep."

logs from queue proxy

{"severity":"ERROR","timestamp":"2024-06-26T01:08:15.397562115Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 9\nTCP: inuse 1 orphan 2 tw 12 alloc 207 mem 9\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"41769de","knative.dev/key":"dxcm/plugin-github-test-00001","knative.dev/pod":"plugin-github-test-00001-deployment-76cd4695bf-nfzch","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}

timeout difference: 2024-06-26T00:58:17Z vs 2024-06-26T01:08:15.397562115Z

indicating a 10 minute timeout.

error from activator:

{"severity":"ERROR","timestamp":"2024-06-26T01:08:15.397202332Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 146\nTCP: inuse 8 orphan 2 tw 99 alloc 2264 mem 285\nUDP: inuse 0 mem 8\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"41769de","knative.dev/controller":"activator","knative.dev/pod":"activator-b7789d69d-lcwb7","knative.dev/key":"dxcm/plugin-github-test-00001","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:141\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:144\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:97\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:237\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\tknative.dev/serving/pkg/activator/net/throttler.go:767\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:227\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:522\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:90\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}

activator termination

terminationGracePeriodSeconds: 604800

line of code in activator that starts timeout: activator request timeou

seems to be referencing: https://github.com/knative/serving/blob/main/pkg/apis/config/defaults.go#L45

(the max timeout) instead of the specified max timeout from me.

More Validation

My knative service spec:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:
  creationTimestamp: "2024-06-25T02:32:13Z"
  generation: 8
  labels:
    app.fianu.io/type: occurrence
    app.kubernetes.io/component: adapter
    app.kubernetes.io/instance: github
    app.kubernetes.io/managed-by: fianu-controller
    app.kubernetes.io/name: plugin
    app.kubernetes.io/part-of: fianu
    argocd.argoproj.io/compare-options: IgnoreExtraneous
    destinations.fianu.io/broker: fianu-default
    networking.knative.dev/visibility: cluster-local
  name: plugin-github-test
  namespace: dxcm
spec:
  template:
    metadata:
      annotations:
        argocd.argoproj.io/compare-options: IgnoreExtraneous
        sidecar.istio.io/proxyCPU: 50m
        sidecar.istio.io/proxyCPULimit: 150m
        sidecar.istio.io/proxyMemory: 128Mi
        sidecar.istio.io/proxyMemoryLimit: 256Mi
      creationTimestamp: null
    spec:
      containerConcurrency: 0
      containers:
      - env:
        - name: FIANU_PLATFORM_SEMANTIC_VERSION
          valueFrom:
            configMapKeyRef:
              key: fianu_semantic_version
              name: fianu-hosts
        - configMapRef:
            name: fianu-hosts-env
        image: docker.io
        imagePullPolicy: Always
        name: github-plugin
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources: {}
        securityContext: {}
      responseStartTimeoutSeconds: 600
      serviceAccountName: github
      timeoutSeconds: 18500

timeoutSeconds doesn't seem to be respected.


I also built the activator from source, and added loggin:

Is what I added:

log.Println("making request with values ", time.Duration(*rev.Spec.TimeoutSeconds)*time.Second, responseStartTimeout, idleTimeout)

From Broker:

2024/06/26 01:46:14 makign request with values  168h0m0s 10m0s 0s

With Curl

2024/06/26 01:47:53 makign request with values  168h0m0s 10m0s 0s

So, it looks like the timeout is being set properly from the spec.TimeoutSeconds


My only thoughts now, are is my delivery timeout not being respected:

spec:
  broker: default
  delivery:
    timeout: PT2H
    retry: 1
  filter:
    attributes:
      source: ci.commit.codereview.test
      type: attestation.test
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service
      name: plugin-github-test
      namespace: test

But this does not make sense. Because I reproduce with curl:

# curl --location 'http://plugin-github-test.dxcm.svc.cluster.local' \
--header 'Ce-Id: replay-189098f' \
--header 'Ce-Specversion: 1.0' \
--header 'Ce-Type: attestation.test' \
--header 'Ce-Source: ci.commit.codereview.test' \
--header 'Content-Type: application/json' \
--data '{}'> > > > > > 
activator request timeout# 

which wouldn't even go through kafka, and I get the same timeout.

Is there something I need to do on the kafka side?

nkreiger avatar Jun 26 '24 01:06 nkreiger