followed quickstart - on openshift - getting "upstream connect error or disconnect/reset before headers. reset reason: protocol error"
🐛 Describe the bug
run the quickstart and unable to run a curl for a completion
Steps to Reproduce
- follow instructions at https://aibrix.readthedocs.io/latest/getting_started/quickstart.html
- created dependency
- created core
- all pods running in envoy and in abrix
- deployed deepseek (it is also running)
- port-forward (kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &)
- issue curl
curl -v http://localhost:8888/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}'
* Host localhost:8888 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
* Trying [::1]:8888...
* Connected to localhost (::1) port 8888
> POST /v1/completions HTTP/1.1
> Host: localhost:8888
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 148
>
* upload completely sent off: 148 bytes
< HTTP/1.1 502 Bad Gateway
< content-length: 87
< content-type: text/plain
< date: Wed, 26 Feb 2025 22:10:37 GMT
<
* Connection #0 to host localhost left intact
upstream connect error or disconnect/reset before headers. reset reason: protocol error%
Expected behavior
I expect the quickstart to work quickly
Environment
version 2.0 of abrix 4.14 of OpenShift deepseek as required by quickstart
Thanks for reporting the issue. Could I know your status of gateway service?
kubectl get svc -n envoy-gateway-system
kubectl get svc -n envoy-gateway-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
deepseek-r1-distill-llama-8b ClusterIP 172.30.34.138 <none> 8000/TCP,8080/TCP 138m
envoy-aibrix-system-aibrix-eg-903790dc LoadBalancer 172.30.59.32 fbce761b-us-east.lb.appdomain.cloud 80:30402/TCP 3h43m
envoy-gateway ClusterIP 172.30.28.241 <none> 18000/TCP,18001/TCP,18002/TCP,19001/TCP 9h
had a long discussion with Varun Gupta last night. Can hit the model directly successfully, but AIbrix gateway does not pass the query along. Waiting for next steps
Due to our environment conguration in OpenShift (we have per/namespace perms for users) there were a few issues with the default configuration of: a. envoy-gateway-system job - this needed security context b. anyuid treatment for default and envoy-gateway-system SA c. to do the above, you need to create the envoy-gateway-system NS first
- download the yaml for dependency and core
- create the NS and add anyuid to service account 'default'
oc create ns envoy-gateway-system
oc adm policy add-scc-to-user anyuid -z default -n envoy-gateway-system
- update yaml for job (bottom of file aibrix-dependency-v0.2.0.yaml)
apiVersion: batch/v1
kind: Job
metadata:
annotations:
helm.sh/hook: pre-install, pre-upgrade
labels:
app.kubernetes.io/instance: eg
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: gateway-helm
app.kubernetes.io/version: latest
helm.sh/chart: gateway-helm-v0.0.0-latest
name: eg-gateway-helm-certgen
namespace: envoy-gateway-system
spec:
backoffLimit: 1
completions: 1
parallelism: 1
template:
metadata:
labels:
app: certgen
spec:
containers:
- command:
- envoy-gateway
- certgen
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
seccompProfile:
type: RuntimeDefault
env:
- name: ENVOY_GATEWAY_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: KUBERNETES_CLUSTER_DOMAIN
value: cluster.local
image: envoyproxy/gateway:v1.1.0
imagePullPolicy: IfNotPresent
name: envoy-gateway-certgen
imagePullSecrets: []
restartPolicy: Never
serviceAccountName: eg-gateway-helm-certgen
ttlSecondsAfterFinished: 30
- create (not apply) the dependency
oc create -f aibrix-dependency-v0.2.0.yaml
oc get jobs ( look for 'completed')
oc get secret (look for 'envoy-gateway')
oc get pods (look for 'completed' and 'running')
- add anyuid to envoy-gateway-system SA
oc adm policy add-scc-to-user anyuid -z envoy-gateway -n envoy-gateway-system
- Delete the envoy-gateway-system pod so it can recreate and make use of the SA privileges
oc delete pod envoy-gateway-system-xxxx -n envoy-gateway-system
- look for Errors in logs (should be clean now that SA has privs)
oc logs pod envoy-gateway-system
- apply the core
oc apply -f core/aibrix-core-v0.2.0.yamlc get pods (look for 'running')
- ensure extension policy is running
oc describe envoyextensionpolicy -A
reason: invalid
status: false
- deploy the deepseek model (in some other namespace - not envoy or aibrix) (per original instructions)
- do the port-foward (per original instructions)
- try a conversation (curl stuff - per original instructions)
Thank you @varungup90 for your help
Slightly orthogonal to this issue, first step on debugging request error is to ensure that envoy extension policy and httproute has Accepted status. I will update quickstart documentation as well.
Router name will be different based of model name.
This is incredibly helpful for diagnosing the issues. I believe we should provide more detailed installation guidance and include a dedicated documentation page for OpenShift environments.