Gateway error when making chat completion requests with p/d disaggregation model
🐛 Describe the bug
When deployed p/d disaggregation model using this: https://aibrix.readthedocs.io/latest/getting_started/quickstart.html#deploy-prefill-decode-pd-disaggregation-model With the following curl command I get the following error:
curl -v http://k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com/v1/chat/completions \
-H "routing-strategy: pd" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "help me write a random generator in python"}
],
"temperature": 0.7
}'
* Host k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com:80 was resolved.
* IPv6: (none)
* IPv4: 10.0.1.57
* Trying 10.0.1.57:80...* Connected to k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com (10.0.1.57) port 80
* using HTTP/1.x
> POST /v1/chat/completions HTTP/1.1
> Host: k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com
> User-Agent: curl/8.11.1
> Accept: */*
> routing-strategy: pd
> Content-Type: application/json
> Content-Length: 249
>
* upload completely sent off: 249 bytes
< HTTP/1.1 502 Bad Gateway
< content-length: 87
< content-type: text/plain
< date: Wed, 13 Aug 2025 15:26:47 GMT
<
* Connection #0 to host k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com left intact
upstream connect error or disconnect/reset before headers. reset reason: protocol error
Now using the /models endpoint works:
curl -v http://k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com/v1/models
* Host k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com:80 was resolved.
* IPv6: (none)
* IPv4: 10.0.1.57
* Trying 10.0.1.57:80...
* Connected to k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com (10.0.1.57) port 80
* using HTTP/1.x
> GET /v1/models HTTP/1.1
> Host: k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com
> User-Agent: curl/8.11.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 200 OK
< content-type: application/json< date: Wed, 13 Aug 2025 15:27:56 GMT
< content-length: 113
<
* Connection #0 to host k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com left intact
{"object":"list","data":[{"id":"deepseek-r1-distill-llama-8b","created":0,"object":"model","owned_by":"aibrix"}]}
I can see from the envoy proxy logs:
{":authority":"k8s-envoygat-envoyaib-17b348e7d0-0e5af338e460b06e.elb.us-east-1.amazonaws.com","bytes_received":249,"bytes_sent":87,"connection_termination_details":null,"downstream_local_address":"10.0.1.129:10080","downstream_remote_address":"10.0.1.13:50918","duration":33,"method":"POST","protocol":"HTTP/1.1","requested_server_name":null,"response_code":502,"response_code_details":"upstream_reset_before_response_started{protocol_error}","response_flags":"UPE","route_name":"httproute/aibrix-system/aibrix-reserved-router/rule/0/match/0/*","start_time":"2025-08-13T15:28:52.700Z","upstream_cluster":"httproute/aibrix-system/aibrix-reserved-router/rule/0","upstream_host":"10.0.3.250:50052","upstream_local_address":"10.0.1.129:48966","upstream_transport_failure_reason":null,"user-agent":"curl/8.11.1","x-envoy-origin-path":"/v1/chat/completions","x-envoy-upstream-service-time":null,"x-forwarded-for":"10.0.1.13","x-request-id":"6dc72b68-0252-4fec-82c1-8027732fc7b2"}
When looking into the gateway-plugins I can see that the request does passed to the prefill instance:
I0813 15:28:52.731622 1 pd_disaggregation.go:196] "prefill_request_complete" request_id="d0789ace-b858-4909-8321-48c299dfa359"
I0813 15:28:52.731663 1 gateway_req_body.go:90] "request start" requestID="d0789ace-b858-4909-8321-48c299dfa359" requestPath="/v1/chat/completions" model="deepseek-r1-distill-llama-8b" stream=false routingAlgorithm="pd" targetPodIP="10.0.3.160:8000" routingDuration="29.03084m
And indeed the logs from the prefill instance shows:
INFO 08-13 08:28:52 [logger.py:43] Received request chatcmpl-1a7b11c88afc466885bac44c09d36357: prompt: '<|begin▁of▁sentence|>You are a helpful assistant.<|User|>help me write a random generator in python<|Assistant|><think>\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.95, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 08-13 08:28:52 [async_llm.py:270] Added request chatcmpl-1a7b11c88afc466885bac44c09d36357.
INFO 08-13 08:29:00 [loggers.py:118] Engine 000: Avg prompt throughput: 1.9 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 42.1%
But no request has reached the decode instance. My guess is that the gateway-plugin is returning a bad response to envoy ( but it's just a guess ). I tried to play with the request idle time by adding
timeouts:
request: 10s
backendRequest: 2s
to the HTTPRoute object but it doesn't help.
The image I used for the stormservice was build using this
Steps to Reproduce
- Install AIBrix (I used helm): https://aibrix.readthedocs.io/latest/getting_started/installation/installation.html#stable-version-using-helm
- Build the image: https://github.com/vllm-project/aibrix/blob/main/samples/disaggregation/vllm/README.md#configuration
- Deploy p/d disaggregation model: https://aibrix.readthedocs.io/latest/getting_started/quickstart.html#deploy-prefill-decode-pd-disaggregation-model
- Run a simple request such as:
curl -v http://${ENDPOINT}/v1/chat/completions \
-H "routing-strategy: pd" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "help me write a random generator in python"}
],
"temperature": 0.7
}'
Expected behavior
Should not crush
Environment
K8s version: v1.33.2-eks-931bdca AIBrix version: v0.4.0 Model deployed: deepseek-r1-distill-llama-8b
@omerap12 We made some improvements to vLLM here https://github.com/vllm-project/aibrix/pull/1429 I am not 100% sure if that's related. could you build a new image and update the router?
Please also try to invoke the gateway clusterIP in any of the pods. I just like to confirm any possibility that NLB or ALB issues.
If this still happens, free feel to follow up on slack.
@omerap12 We made some improvements to vLLM here https://github.com/vllm-project/aibrix/pull/1429 I am not 100% sure if that's related. could you build a new image and update the router?
Please also try to invoke the gateway clusterIP in any of the pods. I just like to confirm any possibility that NLB or ALB issues.
If this still happens, free feel to follow up on slack.
Will do
I have a similar issue, caused by envoy timeouts after 15s. Here is how Sonnet fixed it :
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: aibrix-custom-proxy-config
namespace: aibrix-system
labels:
app.kubernetes.io/component: aibrix-gateway-plugin
app.kubernetes.io/managed-by: kubectl
app.kubernetes.io/name: aibrix
app.kubernetes.io/version: nightly
spec:
# Add timeout configuration
backendDefaults:
timeout:
# Increase upstream timeout from 15s to 300s (5 minutes)
upstream: 300s
[...]
@ghpu Thanks! Did you test that this works? I tested again briefly and it didn't help so just want to make sure it indeed solved the problem.
Well, it worked for me.
Well, it worked for me.
Can you please share your entire envoyProxy config?
I can't see spec.backendDefaults in the CRD..
You can set a timeout on the HTTPRoute object like this (ref: https://gateway-api.sigs.k8s.io/api-types/httproute/#timeouts-optional):
rules:
- backendRefs:
- group: ""
kind: Service
name: aibrix-gateway-plugins
port: 50052
weight: 1
matches:
- path:
type: PathPrefix
value: /v1/chat/completions
- path:
type: PathPrefix
value: /v1/completions
timeouts:
backendRequest: 300s
But still that doesn't solve the problem.
@omerap12 HTTPRoute configuration doesn't help in this case. P/D router directly talk to the pod and bypass HTTPRoute. could you try to bump aibrix-gateway plugin to v0.4.1 and give another try?
@omerap12 HTTPRoute configuration doesn't help in this case. P/D router directly talk to the pod and bypass HTTPRoute. could you try to bump aibrix-gateway plugin to v0.4.1 and give another try?
Sure
I've also tested this with several different configurations:
- Prefill and decode instances on the same machine.
- Prefill and decode instances on different machines.
- Same machine with RDMA enabled.
- Same machine with RDMA disabled (using
NCCL_IB_DISABLE=1). I also tried a different model (Qwen/Qwen2.5-0.5B-Instruct), and the same problem persists.
It’s strange that this issue happens on EKS but not in ByteDance’s internal environment (thanks to @googs1025 for helping me debug).
@omerap12 I will spend some time this weekend to preproduce the problem on EKS.