feat(k8s-bench): add scripts for HPA, rolling update, StatefulSet tasks
https://github.com/GoogleCloudPlatform/kubectl-ai/issues/145
Overview
This pull request adds three comprehensive Kubernetes benchmark scenarios under k8s-bench/tasks to extend coverage of everyday cluster operations:
- rolling-update-deployment
- horizontal-pod-autoscaler
- statefulset-lifecycle
Each scenario follows the existing pattern with four files:
task.yaml– script prompt & metadatasetup.sh– resource provisioning & readiness loopsverify.sh– assertions on desired statecleanup.sh– namespace/resource teardown
1. rolling-update-deployment
Goal: Zero-downtime image rollout for a Deployment.
- Setup: Create namespace
rollout-test, deployweb-app:nginx:1.21(3 replicas), wait for readiness - Script:
kubectl set imagetonginx:1.22 - Verify:
kubectl rollout status, ensure all pods usenginx:1.22 - Cleanup: Delete
rollout-testnamespace
2. horizontal-pod-autoscaler
Goal: Exercise HPA targeting 50% CPU utilization.
-
Setup:
- Create namespace
hpa-test - Deploy a BusyBox CPU-burner with 100m CPU request
- Create HPA (min=1, max=3, target=50% CPU)
- Create namespace
-
Script: Generate sustained CPU load via BusyBox loop
-
Verify: Wait for HPA to scale above 1 replica
-
Cleanup: Delete
hpa-testnamespace
3. statefulset-lifecycle
Goal: Validate StatefulSet scaling and data persistence.
-
Setup:
- Create namespace
statefulset-test - Apply headless Service
db - Deploy StatefulSet
db(5 replicas, 1Gi PVC) writing test data
- Create namespace
-
Script:
- Scale down to 2 replicas, confirm only
db-0&db-1remain
- Scale down to 2 replicas, confirm only
-
Verify: Pod counts and persistent storage checks
-
Cleanup: Delete
statefulset-testnamespace
Motivation
- Covers critical Kubernetes workflows: rolling updates, autoscaling, stateful workloads
- Enhances benchmark suite for real-world LLM-driven
kubectlagents - Maintains consistency with existing task structure and conventions
Testing
- Executed
./k8s-bench run --task-pattern rollout,hpa,statefulsetagainst a Kind cluster - Confirmed zero-exit code for success and non-zero for failure conditions
- Verified no regressions in existing tasks
No breaking changes introduced. All scripts are idempotent and contained within their own namespaces.
K8s-bench Evaluation Results
Model Performance Summary
| Model | Success | Fail |
|---|---|---|
| gpt-4.1 | 2 | 1 |
| Total | 2 | 1 |
Overall Summary
- Total Runs: 3
- Overall Success: 2 (66%)
- Overall Fail: 1 (33%)
Model: gpt-4.1
| Task | Provider | Result |
|---|---|---|
| horizontal-pod-autoscaler | openai | ✅ success |
| rolling-update-deployment | openai | ✅ success |
| statefulset-lifecycle | openai | ❌ |
gpt-4.1 Summary
- Total: 3
- Success: 2 (66%)
- Fail: 1 (33%)
K8s-bench Evaluation Results
Model Performance Summary
| Model | Success | Fail |
|---|---|---|
| gpt-4.1 | 3 | 0 |
| o4-mini | 0 | 1 |
| Total | 3 | 1 |
Overall Summary
- Total Runs: 4
- Overall Success: 3 (75%)
- Overall Fail: 1 (25%)
Model: gpt-4.1
| Task | Provider | Result |
|---|---|---|
| horizontal-pod-autoscaler | openai | ✅ success |
| rolling-update-deployment | openai | ✅ success |
| statefulset-lifecycle | openai | ✅ success |
gpt-4.1 Summary
- Total: 3
- Success: 3 (100%)
- Fail: 0 (0%)
Model: o4-mini
| Task | Provider | Result |
|---|---|---|
| statefulset-lifecycle | openai | ❌ |
o4-mini Summary
- Total: 1
- Success: 0 (0%)
- Fail: 1 (100%)
Report generated on May 8, 2025 at 9:18 PM