aibrix icon indicating copy to clipboard operation
aibrix copied to clipboard

Testing AIBrix on AWS EKS Cluster

Open Jeffwan opened this issue 10 months ago • 1 comments

🚀 Feature Description and Motivation

In the past, we use volcano engine as the primary platform to test aibrix. Now, it's time to test against other public cloud providers. Honestly, I think the most cost-efficient plan is lambda but lambda primarily provides the VM instead of the cluster. We can only deploy the kind but this is too much work.

Let's test against the AWS EKS cluster to check the compatibility. At the same time, we need to update all the docs to go with vLLM latest version and remove some VE container image registry or TOS etc.

Use Case

No response

Proposed Solution

No response

Jeffwan avatar Feb 11 '25 05:02 Jeffwan

Image

Image

Jeffwan avatar Feb 11 '25 05:02 Jeffwan

Setup EKS Cluster

eksctl create cluster --name aibrix --node-type=g5.4xlarge --nodes 2 --auto-kubeconfig
2025-04-28 21:47:55 [ℹ]  eksctl version 0.187.0-dev+707c73b66.2024-07-16T06:38:53Z
2025-04-28 21:47:55 [ℹ]  using region us-west-2
2025-04-28 21:47:55 [ℹ]  skipping us-west-2d from selection because it doesn't support the following instance type(s): g5.4xlarge
2025-04-28 21:47:55 [ℹ]  setting availability zones to [us-west-2a us-west-2c us-west-2b]
2025-04-28 21:47:55 [ℹ]  subnets for us-west-2a - public:192.168.0.0/19 private:192.168.96.0/19
2025-04-28 21:47:55 [ℹ]  subnets for us-west-2c - public:192.168.32.0/19 private:192.168.128.0/19
2025-04-28 21:47:55 [ℹ]  subnets for us-west-2b - public:192.168.64.0/19 private:192.168.160.0/19
2025-04-28 21:47:55 [ℹ]  nodegroup "ng-fc753bf9" will use "" [AmazonLinux2/1.30]
2025-04-28 21:47:55 [ℹ]  using Kubernetes version 1.30
2025-04-28 21:47:55 [ℹ]  creating EKS cluster "aibrix" in "us-west-2" region with managed nodes
2025-04-28 21:47:55 [ℹ]  will create 2 separate CloudFormation stacks for cluster itself and the initial managed nodegroup
2025-04-28 21:47:55 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=aibrix'
2025-04-28 21:47:55 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "aibrix" in "us-west-2"
2025-04-28 21:47:55 [ℹ]  CloudWatch logging will not be enabled for cluster "aibrix" in "us-west-2"
2025-04-28 21:47:55 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=us-west-2 --cluster=aibrix'
2025-04-28 21:47:55 [ℹ]  default addons vpc-cni, kube-proxy, coredns were not specified, will install them as EKS addons
2025-04-28 21:47:55 [ℹ]
2 sequential tasks: { create cluster control plane "aibrix",
    2 sequential sub-tasks: {
        2 sequential sub-tasks: {
            1 task: { create addons },
            wait for control plane to become ready,
        },
        create managed nodegroup "ng-fc753bf9",
    }
}
2025-04-28 21:47:55 [ℹ]  building cluster stack "eksctl-aibrix-cluster"
2025-04-28 21:47:56 [ℹ]  deploying stack "eksctl-aibrix-cluster"
2025-04-28 21:48:26 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-cluster"
2025-04-28 21:48:56 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-cluster"
2025-04-28 21:49:56 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-cluster"
2025-04-28 21:50:56 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-cluster"
2025-04-28 21:51:56 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-cluster"
2025-04-28 21:52:56 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-cluster"
2025-04-28 21:53:57 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-cluster"
2025-04-28 21:54:57 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-cluster"
2025-04-28 21:55:57 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-cluster"
2025-04-28 21:55:59 [!]  recommended policies were found for "vpc-cni" addon, but since OIDC is disabled on the cluster, eksctl cannot configure the requested permissions; the recommended way to provide IAM permissions for "vpc-cni" addon is via pod identity associations; after addon creation is completed, add all recommended policies to the config file, under `addon.PodIdentityAssociations`, and run `eksctl update addon`
2025-04-28 21:55:59 [ℹ]  creating addon
2025-04-28 21:55:59 [ℹ]  successfully created addon
2025-04-28 21:55:59 [ℹ]  creating addon
2025-04-28 21:56:00 [ℹ]  successfully created addon
2025-04-28 21:56:00 [ℹ]  creating addon
2025-04-28 21:56:00 [ℹ]  successfully created addon
2025-04-28 21:58:01 [ℹ]  building managed nodegroup stack "eksctl-aibrix-nodegroup-ng-fc753bf9"
2025-04-28 21:58:01 [ℹ]  deploying stack "eksctl-aibrix-nodegroup-ng-fc753bf9"
2025-04-28 21:58:02 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-nodegroup-ng-fc753bf9"
2025-04-28 21:58:32 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-nodegroup-ng-fc753bf9"
2025-04-28 21:59:15 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-nodegroup-ng-fc753bf9"
2025-04-28 21:59:51 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-nodegroup-ng-fc753bf9"
2025-04-28 22:01:51 [ℹ]  waiting for CloudFormation stack "eksctl-aibrix-nodegroup-ng-fc753bf9"
2025-04-28 22:01:51 [ℹ]  waiting for the control plane to become ready
2025-04-28 22:01:52 [✔]  saved kubeconfig as "/Users/bytedance/.kube/eksctl/clusters/aibrix"
2025-04-28 22:01:52 [ℹ]  1 task: { install Nvidia device plugin }
W0428 22:01:52.922061   12610 warnings.go:70] spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
2025-04-28 22:01:52 [ℹ]  created "kube-system:DaemonSet.apps/nvidia-device-plugin-daemonset"
2025-04-28 22:01:52 [ℹ]  as you are using the EKS-Optimized Accelerated AMI with a GPU-enabled instance type, the Nvidia Kubernetes device plugin was automatically installed.
	to skip installing it, use --install-nvidia-plugin=false.
2025-04-28 22:01:52 [✔]  all EKS cluster resources for "aibrix" have been created
2025-04-28 22:01:52 [✔]  created 0 nodegroup(s) in cluster "aibrix"
2025-04-28 22:01:53 [ℹ]  nodegroup "ng-fc753bf9" has 2 node(s)
2025-04-28 22:01:53 [ℹ]  node "ip-192-168-24-13.us-west-2.compute.internal" is ready
2025-04-28 22:01:53 [ℹ]  node "ip-192-168-49-240.us-west-2.compute.internal" is ready
2025-04-28 22:01:53 [ℹ]  waiting for at least 2 node(s) to become ready in "ng-fc753bf9"
2025-04-28 22:01:53 [ℹ]  nodegroup "ng-fc753bf9" has 2 node(s)
2025-04-28 22:01:53 [ℹ]  node "ip-192-168-24-13.us-west-2.compute.internal" is ready
2025-04-28 22:01:53 [ℹ]  node "ip-192-168-49-240.us-west-2.compute.internal" is ready
2025-04-28 22:01:53 [✔]  created 1 managed nodegroup(s) in cluster "aibrix"
2025-04-28 22:01:54 [ℹ]  kubectl command should work with "/Users/bytedance/.kube/eksctl/clusters/aibrix", try 'kubectl --kubeconfig=/Users/bytedance/.kube/eksctl/clusters/aibrix get nodes'
2025-04-28 22:01:54 [✔]  EKS cluster "aibrix" in "us-west-2" region is ready

Jeffwan avatar Apr 29 '25 05:04 Jeffwan

AIBrix control plane spin up successfully.

k get pods
NAME                                         READY   STATUS    RESTARTS   AGE
aibrix-controller-manager-59c78dfb6d-n79sk   1/1     Running   0          62s
aibrix-gateway-plugins-87dfdc75f-8v29b       1/1     Running   0          62s
aibrix-gpu-optimizer-7d7647576d-ggj56        1/1     Running   0          62s
aibrix-kuberay-operator-6fc447d79d-9gxnz     1/1     Running   0          62s
aibrix-metadata-service-78958589d8-t7sxr     1/1     Running   0          62s
aibrix-redis-master-8f86f8897-m7srv          1/1     Running   0          62s

Jeffwan avatar Apr 29 '25 05:04 Jeffwan

data plane traffic

The ELB works fine, I do not see any issues connecting to it.

Image

https://aibrix.readthedocs.io/latest/getting_started/quickstart.html

this works as well

Jeffwan avatar Apr 29 '25 06:04 Jeffwan