https://github.com/GoogleCloudPlatform/kubectl-ai/issues/145

Overview

This pull request adds three comprehensive Kubernetes benchmark scenarios under k8s-bench/tasks to extend coverage of everyday cluster operations:

rolling-update-deployment
horizontal-pod-autoscaler
statefulset-lifecycle

Each scenario follows the existing pattern with four files:

task.yaml – script prompt & metadata
setup.sh – resource provisioning & readiness loops
verify.sh – assertions on desired state
cleanup.sh – namespace/resource teardown

1. rolling-update-deployment

Goal: Zero-downtime image rollout for a Deployment.

Setup: Create namespace rollout-test, deploy web-app:nginx:1.21 (3 replicas), wait for readiness
Script: kubectl set image to nginx:1.22
Verify: kubectl rollout status, ensure all pods use nginx:1.22
Cleanup: Delete rollout-test namespace

2. horizontal-pod-autoscaler

Goal: Exercise HPA targeting 50% CPU utilization.

Setup:
- Create namespace hpa-test
- Deploy a BusyBox CPU-burner with 100m CPU request
- Create HPA (min=1, max=3, target=50% CPU)
Script: Generate sustained CPU load via BusyBox loop
Verify: Wait for HPA to scale above 1 replica
Cleanup: Delete hpa-test namespace

3. statefulset-lifecycle

Goal: Validate StatefulSet scaling and data persistence.

Setup:
- Create namespace statefulset-test
- Apply headless Service db
- Deploy StatefulSet db (5 replicas, 1Gi PVC) writing test data
Script:
- Scale down to 2 replicas, confirm only db-0 & db-1 remain
Verify: Pod counts and persistent storage checks
Cleanup: Delete statefulset-test namespace

Motivation

Covers critical Kubernetes workflows: rolling updates, autoscaling, stateful workloads
Enhances benchmark suite for real-world LLM-driven kubectl agents
Maintains consistency with existing task structure and conventions

Testing

Executed ./k8s-bench run --task-pattern rollout,hpa,statefulset against a Kind cluster
Confirmed zero-exit code for success and non-zero for failure conditions
Verified no regressions in existing tasks

No breaking changes introduced. All scripts are idempotent and contained within their own namespaces.

May 08 '25 06:05 tuannvm

K8s-bench Evaluation Results

Model Performance Summary

Model	Success	Fail
gpt-4.1	2	1
Total	2	1

Overall Summary

Total Runs: 3
Overall Success: 2 (66%)
Overall Fail: 1 (33%)

Model: gpt-4.1

Task	Provider	Result
horizontal-pod-autoscaler	openai	✅ success
rolling-update-deployment	openai	✅ success
statefulset-lifecycle	openai	❌

gpt-4.1 Summary

Total: 3
Success: 2 (66%)
Fail: 1 (33%)

May 08 '25 06:05 tuannvm

K8s-bench Evaluation Results

Model Performance Summary

Model	Success	Fail
gpt-4.1	3	0
o4-mini	0	1
Total	3	1

Overall Summary

Total Runs: 4
Overall Success: 3 (75%)
Overall Fail: 1 (25%)

Model: gpt-4.1

Task	Provider	Result
horizontal-pod-autoscaler	openai	✅ success
rolling-update-deployment	openai	✅ success
statefulset-lifecycle	openai	✅ success

gpt-4.1 Summary

Total: 3
Success: 3 (100%)
Fail: 0 (0%)

Model: o4-mini

Task	Provider	Result
statefulset-lifecycle	openai	❌

o4-mini Summary

Total: 1
Success: 0 (0%)
Fail: 1 (100%)

Report generated on May 8, 2025 at 9:18 PM

May 09 '25 04:05 tuannvm

feat(k8s-bench): add scripts for HPA, rolling update, StatefulSet tasks

Overview

1. rolling-update-deployment

2. horizontal-pod-autoscaler

3. statefulset-lifecycle

Motivation

Testing

K8s-bench Evaluation Results

Model Performance Summary

Overall Summary

Model: gpt-4.1

K8s-bench Evaluation Results

Model Performance Summary

Overall Summary

Model: gpt-4.1

Model: o4-mini