aibrix Gateway error when making chat completion requests with p/d disaggregation model

🐛 Describe the bug

When deployed p/d disaggregation model using this: https://aibrix.readthedocs.io/latest/getting_started/quickstart.html#deploy-prefill-decode-pd-disaggregation-model With the following curl command I get the following error:

curl -v http://k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com/v1/chat/completions \
-H "routing-strategy: pd" \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-r1-distill-llama-8b",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "help me write a random generator in python"}
    ],
    "temperature": 0.7
}'
* Host k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com:80 was resolved.
* IPv6: (none)
* IPv4: 10.0.1.57
*   Trying 10.0.1.57:80...* Connected to k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com (10.0.1.57) port 80
* using HTTP/1.x
> POST /v1/chat/completions HTTP/1.1
> Host: k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com
> User-Agent: curl/8.11.1
> Accept: */*
> routing-strategy: pd
> Content-Type: application/json
> Content-Length: 249
>
* upload completely sent off: 249 bytes
< HTTP/1.1 502 Bad Gateway
< content-length: 87
< content-type: text/plain
< date: Wed, 13 Aug 2025 15:26:47 GMT
<
* Connection #0 to host k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com left intact
upstream connect error or disconnect/reset before headers. reset reason: protocol error

Now using the /models endpoint works:

curl -v http://k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com/v1/models
* Host k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com:80 was resolved.
* IPv6: (none)
* IPv4: 10.0.1.57
*   Trying 10.0.1.57:80...
* Connected to k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com (10.0.1.57) port 80
* using HTTP/1.x
> GET /v1/models HTTP/1.1
> Host: k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com
> User-Agent: curl/8.11.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 200 OK
< content-type: application/json< date: Wed, 13 Aug 2025 15:27:56 GMT
< content-length: 113
<
* Connection #0 to host k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com left intact
{"object":"list","data":[{"id":"deepseek-r1-distill-llama-8b","created":0,"object":"model","owned_by":"aibrix"}]}

I can see from the envoy proxy logs:

{":authority":"k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com","bytes_received":249,"bytes_sent":87,"connection_termination_details":null,"downstream_local_address":"10.0.1.129:10080","downstream_remote_address":"10.0.1.13:50918","duration":33,"method":"POST","protocol":"HTTP/1.1","requested_server_name":null,"response_code":502,"response_code_details":"upstream_reset_before_response_started{protocol_error}","response_flags":"UPE","route_name":"httproute/aibrix-system/aibrix-reserved-router/rule/0/match/0/*","start_time":"2025-08-13T15:28:52.700Z","upstream_cluster":"httproute/aibrix-system/aibrix-reserved-router/rule/0","upstream_host":"10.0.3.250:50052","upstream_local_address":"10.0.1.129:48966","upstream_transport_failure_reason":null,"user-agent":"curl/8.11.1","x-envoy-origin-path":"/v1/chat/completions","x-envoy-upstream-service-time":null,"x-forwarded-for":"10.0.1.13","x-request-id":"6dc72b68-0252-4fec-82c1-8027732fc7b2"}

When looking into the gateway-plugins I can see that the request does passed to the prefill instance:

I0813 15:28:52.731622       1 pd_disaggregation.go:196] "prefill_request_complete" request_id="d0789ace-b858-4909-8321-48c299dfa359"
I0813 15:28:52.731663       1 gateway_req_body.go:90] "request start" requestID="d0789ace-b858-4909-8321-48c299dfa359" requestPath="/v1/chat/completions" model="deepseek-r1-distill-llama-8b" stream=false routingAlgorithm="pd" targetPodIP="10.0.3.160:8000" routingDuration="29.03084m

And indeed the logs from the prefill instance shows:

INFO 08-13 08:28:52 [logger.py:43] Received request chatcmpl-1a7b11c88afc466885bac44c09d36357: prompt: '<｜begin▁of▁sentence｜>You are a helpful assistant.<｜User｜>help me write a random generator in python<｜Assistant｜><think>\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.95, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 08-13 08:28:52 [async_llm.py:270] Added request chatcmpl-1a7b11c88afc466885bac44c09d36357.
INFO 08-13 08:29:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 42.1%

But no request has reached the decode instance. My guess is that the gateway-plugin is returning a bad response to envoy ( but it's just a guess ). I tried to play with the request idle time by adding

timeouts:
      request: 10s
      backendRequest: 2s

to the HTTPRoute object but it doesn't help. The image I used for the stormservice was build using this

Steps to Reproduce

Install AIBrix (I used helm): https://aibrix.readthedocs.io/latest/getting_started/installation/installation.html#stable-version-using-helm
Build the image: https://github.com/vllm-project/aibrix/blob/main/samples/disaggregation/vllm/README.md#configuration
Deploy p/d disaggregation model: https://aibrix.readthedocs.io/latest/getting_started/quickstart.html#deploy-prefill-decode-pd-disaggregation-model
Run a simple request such as:

curl -v http://${ENDPOINT}/v1/chat/completions \
-H "routing-strategy: pd" \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-r1-distill-llama-8b",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "help me write a random generator in python"}
    ],
    "temperature": 0.7
}'

Expected behavior

Should not crush

Environment

K8s version: v1.33.2-eks-931bdca AIBrix version: v0.4.0 Model deployed: deepseek-r1-distill-llama-8b

Aug 13 '25 15:08 omerap12

@omerap12 We made some improvements to vLLM here https://github.com/vllm-project/aibrix/pull/1429 I am not 100% sure if that's related. could you build a new image and update the router?

Please also try to invoke the gateway clusterIP in any of the pods. I just like to confirm any possibility that NLB or ALB issues.

If this still happens, free feel to follow up on slack.

Aug 15 '25 07:08 Jeffwan

@omerap12 We made some improvements to vLLM here https://github.com/vllm-project/aibrix/pull/1429 I am not 100% sure if that's related. could you build a new image and update the router?

Please also try to invoke the gateway clusterIP in any of the pods. I just like to confirm any possibility that NLB or ALB issues.

If this still happens, free feel to follow up on slack.

Will do

Aug 15 '25 10:08 omerap12

I have a similar issue, caused by envoy timeouts after 15s. Here is how Sonnet fixed it :

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: aibrix-custom-proxy-config
  namespace: aibrix-system
  labels:
    app.kubernetes.io/component: aibrix-gateway-plugin
    app.kubernetes.io/managed-by: kubectl
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/version: nightly
spec:
  # Add timeout configuration
  backendDefaults:
    timeout:
      # Increase upstream timeout from 15s to 300s (5 minutes)
      upstream: 300s
[...]

Aug 18 '25 09:08 ghpu

@ghpu Thanks! Did you test that this works? I tested again briefly and it didn't help so just want to make sure it indeed solved the problem.

Aug 18 '25 10:08 omerap12

Well, it worked for me.

Aug 18 '25 10:08 ghpu

Well, it worked for me.

Can you please share your entire envoyProxy config? I can't see spec.backendDefaults in the CRD..

Aug 18 '25 11:08 omerap12

You can set a timeout on the HTTPRoute object like this (ref: https://gateway-api.sigs.k8s.io/api-types/httproute/#timeouts-optional):

rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: aibrix-gateway-plugins
      port: 50052
      weight: 1
    matches:
    - path:
        type: PathPrefix
        value: /v1/chat/completions
    - path:
        type: PathPrefix
        value: /v1/completions
    timeouts:
      backendRequest: 300s

But still that doesn't solve the problem.

Aug 19 '25 06:08 omerap12

@omerap12 HTTPRoute configuration doesn't help in this case. P/D router directly talk to the pod and bypass HTTPRoute. could you try to bump aibrix-gateway plugin to v0.4.1 and give another try?

Aug 19 '25 11:08 Jeffwan

@omerap12 HTTPRoute configuration doesn't help in this case. P/D router directly talk to the pod and bypass HTTPRoute. could you try to bump aibrix-gateway plugin to v0.4.1 and give another try?

Sure

Aug 19 '25 11:08 omerap12

I've also tested this with several different configurations:

Prefill and decode instances on the same machine.
Prefill and decode instances on different machines.
Same machine with RDMA enabled.
Same machine with RDMA disabled (using NCCL_IB_DISABLE=1). I also tried a different model (Qwen/Qwen2.5-0.5B-Instruct), and the same problem persists.

It’s strange that this issue happens on EKS but not in ByteDance’s internal environment (thanks to @googs1025 for helping me debug).

Sep 23 '25 12:09 omerap12

@omerap12 I will spend some time this weekend to preproduce the problem on EKS.

Sep 24 '25 05:09 Jeffwan

@omerap12 I will spend some time this weekend to preproduce the problem on EKS.

Thanks @Jeffwan !

Sep 24 '25 06:09 omerap12