aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

followed quickstart - on openshift - getting "upstream connect error or disconnect/reset before headers. reset reason: protocol error"

Open clubanderson opened this issue 10 months ago • 2 comments

🐛 Describe the bug

run the quickstart and unable to run a curl for a completion

Steps to Reproduce

  1. follow instructions at https://aibrix.readthedocs.io/latest/getting_started/quickstart.html
  2. created dependency
  3. created core
  4. all pods running in envoy and in abrix
  5. deployed deepseek (it is also running)
  6. port-forward (kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &)
  7. issue curl
curl -v http://localhost:8888/v1/completions \      
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-r1-distill-llama-8b",
        "prompt": "San Francisco is a",
        "max_tokens": 128,
        "temperature": 0
    }'
* Host localhost:8888 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8888...
* Connected to localhost (::1) port 8888
> POST /v1/completions HTTP/1.1
> Host: localhost:8888
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 148
>
* upload completely sent off: 148 bytes
< HTTP/1.1 502 Bad Gateway
< content-length: 87
< content-type: text/plain
< date: Wed, 26 Feb 2025 22:10:37 GMT
<
* Connection #0 to host localhost left intact
upstream connect error or disconnect/reset before headers. reset reason: protocol error%

Expected behavior

I expect the quickstart to work quickly

Environment

version 2.0 of abrix 4.14 of OpenShift deepseek as required by quickstart

clubanderson avatar Feb 26 '25 22:02 clubanderson

Thanks for reporting the issue. Could I know your status of gateway service?

kubectl get svc -n envoy-gateway-system

Jeffwan avatar Feb 26 '25 22:02 Jeffwan

kubectl get svc -n envoy-gateway-system             
NAME                                     TYPE           CLUSTER-IP      EXTERNAL-IP                           PORT(S)                                   AGE
deepseek-r1-distill-llama-8b             ClusterIP      172.30.34.138   <none>                                8000/TCP,8080/TCP                         138m
envoy-aibrix-system-aibrix-eg-903790dc   LoadBalancer   172.30.59.32    fbce761b-us-east.lb.appdomain.cloud   80:30402/TCP                              3h43m
envoy-gateway                            ClusterIP      172.30.28.241   <none>                                18000/TCP,18001/TCP,18002/TCP,19001/TCP   9h

clubanderson avatar Feb 26 '25 23:02 clubanderson

had a long discussion with Varun Gupta last night. Can hit the model directly successfully, but AIbrix gateway does not pass the query along. Waiting for next steps

clubanderson avatar Feb 27 '25 14:02 clubanderson

Due to our environment conguration in OpenShift (we have per/namespace perms for users) there were a few issues with the default configuration of: a. envoy-gateway-system job - this needed security context b. anyuid treatment for default and envoy-gateway-system SA c. to do the above, you need to create the envoy-gateway-system NS first

  1. download the yaml for dependency and core
  2. create the NS and add anyuid to service account 'default'
oc create ns envoy-gateway-system
oc adm policy add-scc-to-user anyuid -z default -n envoy-gateway-system
  1. update yaml for job (bottom of file aibrix-dependency-v0.2.0.yaml)
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    helm.sh/hook: pre-install, pre-upgrade
  labels:
    app.kubernetes.io/instance: eg
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: gateway-helm
    app.kubernetes.io/version: latest
    helm.sh/chart: gateway-helm-v0.0.0-latest
  name: eg-gateway-helm-certgen
  namespace: envoy-gateway-system
spec:
  backoffLimit: 1
  completions: 1
  parallelism: 1
  template:
    metadata:
      labels:
        app: certgen
    spec:
      containers:
      - command:
        - envoy-gateway
        - certgen
        securityContext:
          runAsNonRoot: true
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
          seccompProfile:
            type: RuntimeDefault
        env:
        - name: ENVOY_GATEWAY_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: KUBERNETES_CLUSTER_DOMAIN
          value: cluster.local
        image: envoyproxy/gateway:v1.1.0
        imagePullPolicy: IfNotPresent
        name: envoy-gateway-certgen
      imagePullSecrets: []
      restartPolicy: Never
      serviceAccountName: eg-gateway-helm-certgen
  ttlSecondsAfterFinished: 30
  1. create (not apply) the dependency
oc create -f aibrix-dependency-v0.2.0.yaml
oc get jobs ( look for 'completed')
oc get secret (look for 'envoy-gateway')
oc get pods (look for 'completed' and 'running')
  1. add anyuid to envoy-gateway-system SA
oc adm policy add-scc-to-user anyuid -z envoy-gateway -n envoy-gateway-system
  1. Delete the envoy-gateway-system pod so it can recreate and make use of the SA privileges
oc delete pod envoy-gateway-system-xxxx -n envoy-gateway-system
  1. look for Errors in logs (should be clean now that SA has privs)
oc logs pod envoy-gateway-system
  1. apply the core
oc apply -f core/aibrix-core-v0.2.0.yamlc get pods (look for 'running')
  1. ensure extension policy is running
oc describe envoyextensionpolicy -A
    reason: invalid
    status: false
  1. deploy the deepseek model (in some other namespace - not envoy or aibrix) (per original instructions)
  2. do the port-foward (per original instructions)
  3. try a conversation (curl stuff - per original instructions)

clubanderson avatar Feb 27 '25 18:02 clubanderson

Thank you @varungup90 for your help

clubanderson avatar Feb 27 '25 18:02 clubanderson

Slightly orthogonal to this issue, first step on debugging request error is to ensure that envoy extension policy and httproute has Accepted status. I will update quickstart documentation as well.

Image

Router name will be different based of model name.

Image

varungup90 avatar Feb 27 '25 18:02 varungup90

This is incredibly helpful for diagnosing the issues. I believe we should provide more detailed installation guidance and include a dedicated documentation page for OpenShift environments.

Jeffwan avatar Feb 28 '25 18:02 Jeffwan