serving icon indicating copy to clipboard operation
serving copied to clipboard

Random timeout errors in queue proxy

Open littlekittel opened this issue 4 months ago • 24 comments

Describe the bug We are using knative serving in many services and did not have an issue processing millions of events. Lately we are seeing a weird behaviour where the queue proxy logs timeouts. The application container itself is not under heavy load and can handle thousands of requests. From the user-container itself it looks like the request does not reach the user-container and gets stuck.

Under normal condition the latency is below 0.01s. Suddenly the latency starts to increase to several seconds and even higher until timeouts get fired. The timeout happens at random times. The logs inside the user-container itself are not showing any issues, errors or longer processing times.

After restarting the pod everything is working fine for some minutes / hours until this behaviour starts again.

Expected behavior There should not be any delay introduced between the queue proxy and the user-container.

To Reproduce Can not find a pattern to reproduce.

Knative release version v1.32.6-eks-931bdca

queue-proxy.log

Additional context Logs from the queue proxy: 2025-08-29T09:48:57.328736105Z {"severity":"ERROR","timestamp":"2025-08-29T09:48:57.328515697Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 18\nTCP: inuse 3 orphan 0 tw 19 alloc 62 mem 12\nUDP: inuse 0 mem 1\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"4853ead","knative.dev/key":"docs/docs-filter-cx-helm-ksvc-00010","knative.dev/pod":"docs-filter-cx-helm-ksvc-00010-deployment-88bfcb487-8rr4c","error":"EOF","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:486\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:201\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2294\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2294\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"} 2025-08-29T09:48:57.329030117Z {"httpRequest": {"requestMethod": "POST", "requestUrl": "/", "requestSize": "709", "status": 502, "responseSize": "4", "userAgent": "Vert.x-WebClient/4.5.14", "remoteIp": "10.0.63.225:53338", "serverIp": "10.0.121.80", "referer": "", "latency": "68.097221185s", "protocol": "HTTP/1.1"}, "traceId": "[]"} 2

littlekittel avatar Aug 29 '25 15:08 littlekittel

@littlekittel what version of Knative are you using?

Do you have any logging to know if the request is being cancelled on the client side?

dprotaso avatar Aug 29 '25 16:08 dprotaso

@dprotaso Server Version: v1.32.6

we do not receive any logging from the client side even though the first thing we do is log the incoming request. For the requests that are processed with 200 but with a large delay of >20s we also see the logs in the client after 20 seconds and then it processes within <100 ms. So it somehow looks like the request is stuck before reaching the client.

littlekittel avatar Aug 29 '25 16:08 littlekittel

Server Version: v1.32.6

That's your K8s version - what's your Knative version.

And how have you configured your timeouts on your Knative Service?

dprotaso avatar Aug 29 '25 16:08 dprotaso

sorry, 1.18.1

- name: REVISION_TIMEOUT_SECONDS
          value: '300'

and inside the application:

return &http.Server{
		Addr:         fmt.Sprintf(":%d", cfg.Port),
		Handler:      mux,
		ReadTimeout:  30 * time.Second,
		WriteTimeout: 30 * time.Second,
	}

littlekittel avatar Aug 29 '25 17:08 littlekittel

some more analysis. The logs make it look like the whole setup is not processing anything for some amount of time. Here are 2 logs from the queue proxy under constant load: 2025-08-29T17:54:12.881043330Z {"httpRequest": {"requestMethod": "POST", "requestUrl": "/", "requestSize": "707", "status": 200, "responseSize": "707", "userAgent": "Vert.x-WebClient/4.5.14", "remoteIp": "10.0.37.191:42806", "serverIp": "10.0.41.7", "referer": "", "latency": "0.661045416s", "protocol": "HTTP/1.1"}, "traceId": "[]"}

2025-08-29T17:54:19.610808630Z {"httpRequest": {"requestMethod": "POST", "requestUrl": "/", "requestSize": "707", "status": 200, "responseSize": "707", "userAgent": "Vert.x-WebClient/4.5.14", "remoteIp": "10.0.37.191:40580", "serverIp": "10.0.41.7", "referer": "", "latency": "6.730683193s", "protocol": "HTTP/1.1"}, "traceId": "[]"}

There is a gap of almost 7 seconds even though the system is under constant load.

The logs in the application it self show the same gap:

2025-08-29T17:54:12.880785129Z {"level":"info","ts":1756490052.8806717,"caller":"filter/main.go:155","msg":"Event processed successfully","input_type":"com.amazon.events.event","output_type":"com.amazon.s3.objectcreated.sortedout"}

2025-08-29T17:54:19.610500132Z {"level":"info","ts":1756490059.6103704,"caller":"processor/event_processor.go:112","msg":"Processing S3 event","request_id":"a66671ff-f112-29a0-a574-6bb8f7630570"}

littlekittel avatar Aug 29 '25 18:08 littlekittel

What's your container concurrency set to? You might be hitting that and requests are being queued

dprotaso avatar Aug 29 '25 18:08 dprotaso

  • name: CONTAINER_CONCURRENCY value: '0'

We benchmarked the application itself. It is a very light weight application and can handle over 1000 request per second. The logs are from an experiment where we have < 10 request per second

littlekittel avatar Aug 29 '25 18:08 littlekittel

Hmm.. I don't have any working theories. The app logging that's showing the large latency includes our handlers so I can't really rule out that.

Since you don't have a sample app - are you comfortable building your own queue proxy and adding some of your own debug logging?

You can override the default queue proxy by setting this config map value to your image here

https://github.com/knative/serving/blob/5832a442421a7f0b81a065824b489ca91ed131be/config/core/configmaps/deployment.yaml#L29

Likewise if you can add logging to your app that shows it's processing time that would be useful to rule out

dprotaso avatar Aug 29 '25 18:08 dprotaso

One thing to note you'll want to checkout therelease-1.18 branch when adding your instrumentation

dprotaso avatar Aug 29 '25 18:08 dprotaso

Thank you, we will look into it. Just wanted to share one more log where it became very crazy. For almost 2 hours we had almost only 300s time outs. During this time there was very little load on the service.

queue-proxy.log

littlekittel avatar Aug 29 '25 19:08 littlekittel

During this time there was very little load on the service.

It's hard to say what's going on - the 300s is clearly the timeout of the revision so it's behaving as normal if the request wasn't processed. (note: I have another issue to remove the error log when that happens (maybe turn it into a warning).)

What's your user container app doing? I'm wondering if it's something to do with the interaction there

dprotaso avatar Aug 29 '25 20:08 dprotaso

The application is a simple filter that evaluates the "s3_key" inside the event and performs some regex on it. Under "normal" conditions the latency is very low. Here is the histogram from the queue proxy metrics endpoint:

Image

There is no cpu heavy operation or any database connection etc. that could lead to a delay.

littlekittel avatar Aug 29 '25 20:08 littlekittel

this may not be the same problem, but we traced our timeouts to queue-proxy reusing a stale keep-alive to our backend (uvicorn/fastapi). After ~5s idle the backend closes the socket; the first request after idle rides that half-dead connection, hangs ~120s, and the app never sees it.

We increased the timeout_keep_alive to 300s on the uvicorn app and now it works.

alita-moore avatar Nov 10 '25 05:11 alita-moore

@alita-moore wow good find - do you have a small repro app I could test with?

dprotaso avatar Nov 10 '25 13:11 dprotaso

Unfortunately not, but it should be easy enough to repro the timeout was set to 5s when the error was happening

alita-moore avatar Nov 10 '25 14:11 alita-moore

@alita-moore I'm not familiar with with python/uvicorn so if you vibe-coded a simple helloworld app where the timeout is set that'd triggers this bug that would be useful.

dprotaso avatar Nov 10 '25 14:11 dprotaso

maybe this will help? https://replit.com/join/jwliypfaqc-teamwoven

basically just need to run pip install fastapi uvicorn and then python main.py on a main.py file with this content

from fastapi import FastAPI
import uvicorn

app = FastAPI()

@app.get("/")
def read_root():
    return {"message": "Hello World"}

if __name__ == "__main__":
    uvicorn.run(
        app,
        host="0.0.0.0",
        port=5000,
        timeout_keep_alive=5
    )

If you package this into an image and then push to a knative service it may replicate the behavior. Basically make a request then wait 5 seconds then make another request.

alita-moore avatar Nov 11 '25 12:11 alita-moore

@alita-moore I tried that sample app and couldn't reproduce the problem.

What version of Knative were you using? I'm testing with the latest release (1.20)

dprotaso avatar Nov 19 '25 17:11 dprotaso

Is there anything notable about your kservice/revision yaml and/or installation setup?

dprotaso avatar Nov 19 '25 18:11 dprotaso

Currently v1.16.0

alita-moore avatar Nov 19 '25 18:11 alita-moore

I use flux to manage the cluster, I'm using the following kustomization.

resources:
  - https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-core.yaml
  - https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-hpa.yaml
  # NOTE: if you fail to include dns handling the kourier service will return a 404 for all requests
  - https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-default-domain.yaml
  # NOTE: if you fail to include the net-istio service the service will contanstly reset the connection
  - https://github.com/knative/net-istio/releases/download/knative-v1.16.0/net-istio.yaml
  - security.yaml
patchesStrategicMerge:
  - patch-config-features.yaml
patches:
  - target:
      version: v1
      kind: ConfigMap
      name: config-observability
      namespace: knative-serving
    patch: |-
      - op: add
        path: /metadata/annotations/kustomize.toolkit.fluxcd.io~1substitute
        value: "disabled"
resources:
  - https://github.com/knative/serving/releases/download/knative-v1.16.0/serving-crds.yaml

hopefully this helps. The config on the service btw has the following annotations:

proxy.istio.io/config: |
  {"outboundTrafficPolicy":{"mode":"REGISTRY_ONLY"}}
autoscaling.knative.dev/min-scale: "1"
autoscaling.knative.dev/target: "1"
secret.reloader.stakater.com/reload: "do-container-registry"
deployment.knative.dev/progress-deadline: "315360000s"

so the pod itself should have zero external network access. Also the pod itself i using like 4GB of memory across the different containers and a single call can take up to a minute.

that's just giving some context about the circumstances. Hope that helps

alita-moore avatar Nov 19 '25 18:11 alita-moore

also I have istio setup with its sidecar on the service, which I properly allow traffic on knative related ports. Also, here's the network configs related to blocking network traffic.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: output-transpiler-deny-external-egress
  namespace: ml-serving
spec:
  podSelector:
    matchLabels:
      serving.knative.dev/service: output-transpiler
  policyTypes:
    - Egress
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ml-serving
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: knative-serving
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: istio-system
    - to:
        - ipBlock:
            cidr: 10.0.0.0/8
        - ipBlock:
            cidr: 172.16.0.0/12
        - ipBlock:
            cidr: 192.168.0.0/16
        - ipBlock:
            cidr: 127.0.0.0/8
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

alita-moore avatar Nov 19 '25 18:11 alita-moore

Hope that helps

Can you confirm it's still an issue with Knative 1.20?

dprotaso avatar Nov 19 '25 18:11 dprotaso

I'm too busy right now to check, but will let you know whenever we get around to upgrading.

alita-moore avatar Nov 19 '25 18:11 alita-moore