aibrix followed quickstart - on openshift - getting "upstream connect error or disconnect/reset before headers. reset reason: protocol error"

🐛 Describe the bug

run the quickstart and unable to run a curl for a completion

Steps to Reproduce

follow instructions at https://aibrix.readthedocs.io/latest/getting_started/quickstart.html
created dependency
created core
all pods running in envoy and in abrix
deployed deepseek (it is also running)
port-forward (kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &)
issue curl

curl -v http://localhost:8888/v1/completions \      
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-r1-distill-llama-8b",
        "prompt": "San Francisco is a",
        "max_tokens": 128,
        "temperature": 0
    }'
* Host localhost:8888 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8888...
* Connected to localhost (::1) port 8888
> POST /v1/completions HTTP/1.1
> Host: localhost:8888
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 148
>
* upload completely sent off: 148 bytes
< HTTP/1.1 502 Bad Gateway
< content-length: 87
< content-type: text/plain
< date: Wed, 26 Feb 2025 22:10:37 GMT
<
* Connection #0 to host localhost left intact
upstream connect error or disconnect/reset before headers. reset reason: protocol error%

Expected behavior

I expect the quickstart to work quickly

Environment

version 2.0 of abrix 4.14 of OpenShift deepseek as required by quickstart

Feb 26 '25 22:02 clubanderson

Thanks for reporting the issue. Could I know your status of gateway service?

kubectl get svc -n envoy-gateway-system

Feb 26 '25 22:02 Jeffwan

kubectl get svc -n envoy-gateway-system             
NAME                                     TYPE           CLUSTER-IP      EXTERNAL-IP                           PORT(S)                                   AGE
deepseek-r1-distill-llama-8b             ClusterIP      172.30.34.138   <none>                                8000/TCP,8080/TCP                         138m
envoy-aibrix-system-aibrix-eg-903790dc   LoadBalancer   172.30.59.32    fbce761b-us-east.lb.appdomain.cloud   80:30402/TCP                              3h43m
envoy-gateway                            ClusterIP      172.30.28.241   <none>                                18000/TCP,18001/TCP,18002/TCP,19001/TCP   9h

Feb 26 '25 23:02 clubanderson

had a long discussion with Varun Gupta last night. Can hit the model directly successfully, but AIbrix gateway does not pass the query along. Waiting for next steps

Feb 27 '25 14:02 clubanderson

Due to our environment conguration in OpenShift (we have per/namespace perms for users) there were a few issues with the default configuration of: a. envoy-gateway-system job - this needed security context b. anyuid treatment for default and envoy-gateway-system SA c. to do the above, you need to create the envoy-gateway-system NS first

download the yaml for dependency and core
create the NS and add anyuid to service account 'default'

oc create ns envoy-gateway-system
oc adm policy add-scc-to-user anyuid -z default -n envoy-gateway-system

update yaml for job (bottom of file aibrix-dependency-v0.2.0.yaml)

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    helm.sh/hook: pre-install, pre-upgrade
  labels:
    app.kubernetes.io/instance: eg
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: gateway-helm
    app.kubernetes.io/version: latest
    helm.sh/chart: gateway-helm-v0.0.0-latest
  name: eg-gateway-helm-certgen
  namespace: envoy-gateway-system
spec:
  backoffLimit: 1
  completions: 1
  parallelism: 1
  template:
    metadata:
      labels:
        app: certgen
    spec:
      containers:
      - command:
        - envoy-gateway
        - certgen
        securityContext:
          runAsNonRoot: true
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
          seccompProfile:
            type: RuntimeDefault
        env:
        - name: ENVOY_GATEWAY_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: KUBERNETES_CLUSTER_DOMAIN
          value: cluster.local
        image: envoyproxy/gateway:v1.1.0
        imagePullPolicy: IfNotPresent
        name: envoy-gateway-certgen
      imagePullSecrets: []
      restartPolicy: Never
      serviceAccountName: eg-gateway-helm-certgen
  ttlSecondsAfterFinished: 30

create (not apply) the dependency

oc create -f aibrix-dependency-v0.2.0.yaml
oc get jobs ( look for 'completed')
oc get secret (look for 'envoy-gateway')
oc get pods (look for 'completed' and 'running')

add anyuid to envoy-gateway-system SA

oc adm policy add-scc-to-user anyuid -z envoy-gateway -n envoy-gateway-system

Delete the envoy-gateway-system pod so it can recreate and make use of the SA privileges

oc delete pod envoy-gateway-system-xxxx -n envoy-gateway-system

look for Errors in logs (should be clean now that SA has privs)

oc logs pod envoy-gateway-system

apply the core

oc apply -f core/aibrix-core-v0.2.0.yamlc get pods (look for 'running')

ensure extension policy is running

oc describe envoyextensionpolicy -A
    reason: invalid
    status: false

deploy the deepseek model (in some other namespace - not envoy or aibrix) (per original instructions)
do the port-foward (per original instructions)
try a conversation (curl stuff - per original instructions)

Feb 27 '25 18:02 clubanderson

Thank you @varungup90 for your help

Feb 27 '25 18:02 clubanderson

Slightly orthogonal to this issue, first step on debugging request error is to ensure that envoy extension policy and httproute has Accepted status. I will update quickstart documentation as well.

`Router name will be different based of model name.`

Feb 27 '25 18:02 varungup90

This is incredibly helpful for diagnosing the issues. I believe we should provide more detailed installation guidance and include a dedicated documentation page for OpenShift environments.

Feb 28 '25 18:02 Jeffwan