evals for api review command
This is experimental.
Part of the cluster infra f2f hackathon held last week.
Vibe coded test suite to test our vibe coded commands to review our vibe coded code.
Uses LLM as a judge (claude) to review output of the /api-review command.
Each suite can be configured to use a different model. The default judge is haiku 4.5, golden runs on sonnet, and integration on opus. I'm not sure these are the best choices (other than haiku for the judge), but we should be able to gather some data ๐ค .
Directory Structure
tests/eval/testdata/
โโโ golden/ # Base truth tests - single isolated issues
โ โโโ missing-optional-doc/
โ โ โโโ patch.diff # Triggers ONLY missing-optional-doc
โ โ โโโ expected.txt
โ โโโ undocumented-enum/
โ โ โโโ patch.diff # Triggers ONLY undocumented-enum
โ โ โโโ expected.txt
โ โโโ missing-featuregate/
โ โ โโโ patch.diff # Triggers ONLY missing-featuregate
โ โ โโโ expected.txt
โ โโโ valid-api-change/
โ โโโ patch.diff # Triggers NO issues
โ โโโ expected.txt
โโโ integration/ # Complex scenarios - multiple issues
โโโ new-field-all-issues/
โ โโโ patch.diff # Triggers multiple issues together
โ โโโ expected.txt
โโโ partial-documentation/
โโโ patch.diff
โโโ expected.txt
Test Case Format
patch.diff
Standard git diff format:
diff --git a/config/v1/types.go b/config/v1/types.go
--- a/config/v1/types.go
+++ b/config/v1/types.go
@@ -10,0 +11,5 @@
+// MyField does something
+// +optional
+// +kubebuilder:validation:Enum=Foo;Bar
+MyField string `json:"myField"`
expected.txt
One expected issue per line:
enum values Foo and Bar not documented in comment
optional field does not explain behavior when omitted
Empty file means the API change should pass review with no issues.
Note: Order of issues in expected.txt does not matter. Comparison uses semantic matching, not exact string matching.
Example output where we catch two instances of the command hallucinating new rules:
Look at Judge result and Judge reason for explanations of decisions made.
Details
Running Suite: API Review Eval Suite - /Users/tbarberb/projects/api/tests/eval
==============================================================================
Random Seed: 1764259960
Will run 4 of 4 specs
------------------------------
[BeforeSuite]
/Users/tbarberb/projects/api/tests/eval/eval_test.go:80
STEP: verifying local AGENTS.md exists @ 11/27/25 17:12:40.662
STEP: verifying local .claude/commands/api-review.md exists @ 11/27/25 17:12:40.662
STEP: creating temp directory for clone @ 11/27/25 17:12:40.662
STEP: shallow cloning openshift/api @ 11/27/25 17:12:40.662
STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:12:54.327
STEP: copying local .claude directory to temp clone @ 11/27/25 17:12:54.328
[BeforeSuite] PASSED [13.667 seconds]
------------------------------
API Review Evaluation Golden Tests should correctly identify single issues missing-optional-doc
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
STEP: verifying origin remote exists @ 11/27/25 17:12:54.329
STEP: resetting repo to clean state @ 11/27/25 17:12:54.341
STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:12:54.832
STEP: copying local .claude directory to temp clone @ 11/27/25 17:12:54.832
STEP: reading and applying patch @ 11/27/25 17:12:54.833
STEP: running API review via Claude (claude-sonnet-4-5@20250929) @ 11/27/25 17:12:54.847
STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:14:19.61
Judge result: pass=true, reason=The review correctly identified the core issue: the optional field does not explain behavior when omitted. The output semantic matches the expected issue by identifying that the comment lacks documentation of what happens when `customLogoURL` is not set (i.e., the default logo is used). The review provides appropriate context about the validation pattern allowing empty strings and suggests a fix that documents the omitted behavior.
โข [88.746 seconds]
------------------------------
API Review Evaluation Golden Tests should correctly identify single issues undocumented-enum
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
STEP: verifying origin remote exists @ 11/27/25 17:14:23.075
STEP: resetting repo to clean state @ 11/27/25 17:14:23.097
STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:14:23.263
STEP: copying local .claude directory to temp clone @ 11/27/25 17:14:23.263
STEP: reading and applying patch @ 11/27/25 17:14:23.264
STEP: running API review via Claude (claude-sonnet-4-5@20250929) @ 11/27/25 17:14:23.278
STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:16:31.141
Judge result: pass=true, reason=The review correctly identified that enum values Light and Dark are not documented in the comment. The output expanded on this core issue by also noting incomplete enum documentation, missing specific default behavior (Light vs Dark), and formatting issues, but all of these are semantic expansions of the same fundamental problem: the enum values are not properly documented as required by the API review standards.
โข [131.341 seconds]
------------------------------
API Review Evaluation Golden Tests should correctly identify single issues valid-api-change
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
STEP: verifying origin remote exists @ 11/27/25 17:16:34.417
STEP: resetting repo to clean state @ 11/27/25 17:16:34.433
STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:16:34.589
STEP: copying local .claude directory to temp clone @ 11/27/25 17:16:34.59
STEP: reading and applying patch @ 11/27/25 17:16:34.59
STEP: running API review via Claude (claude-sonnet-4-5@20250929) @ 11/27/25 17:16:34.603
STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:17:26.436
Judge result: pass=false, reason=The expected issues list is empty, but the API review output identifies a real issue: field documentation mismatch between 'bannerText' description and its enum validation. Since there are no expected issues to match against, and the review found an unrelated issue, this fails the comparison criteria.
[FAILED] in [It] - /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:17:29.727
โข [FAILED] [55.310 seconds]
API Review Evaluation Golden Tests should correctly identify single issues [It] valid-api-change
/Users/tbarberb/projects/api/tests/eval/eval_test.go:199
[FAILED] API review did not match expected issues.
Judge reason: The expected issues list is empty, but the API review output identifies a real issue: field documentation mismatch between 'bannerText' description and its enum validation. Since there are no expected issues to match against, and the review found an unrelated issue, this fails the comparison criteria.
Review output:
โ
Linting checks passed.
Now analyzing the changed file for documentation compliance:
## API Review Report
### ๐ File: `config/v1/types_console.go`
**Status: โ FAILED**
---
**Line 49: Field name does not match documentation**
**Current (problematic) code:**
```go
// bannerText is an optional field that specifies a custom banner message
// to display at the top of the console. Valid values are "Info", "Warning",
// and "Error" which control the banner styling. When omitted, no banner
// is displayed.
// +optional
// +kubebuilder:validation:Enum=Info;Warning;Error
BannerText string `json:"bannerText,omitempty"`
```
**Suggested change:**
```diff
- // bannerText is an optional field that specifies a custom banner message
+ // bannerText is an optional field that specifies the type of banner message
// to display at the top of the console. Valid values are "Info", "Warning",
// and "Error" which control the banner styling. When omitted, no banner
// is displayed.
```
**Explanation:** The field name is `bannerText` but the documentation says it "specifies a custom banner message". However, this field only accepts enum values "Info", "Warning", or "Error" - it doesn't accept arbitrary custom messages. The documentation should clarify that this field specifies the **type** of banner, not the message content itself. If this truly is meant to hold custom banner text, then the `Enum` validation is incorrect and should be removed.
---
## Summary
โ **Review Failed**: 1 issue found
The new `BannerText` field has a mismatch between its documentation and validation constraints. The comment suggests it holds "a custom banner message" but the enum validation only allows three specific values ("Info", "Warning", "Error"). Please clarify:
1. If this field should accept arbitrary custom text, remove the `Enum` validation
2. If this field should only accept enum values, update the documentation to reflect that it specifies the banner **type**, not custom message content
Expected issues:
Expected
<bool>: false
to be true
In [It] at: /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:17:29.727
------------------------------
API Review Evaluation Integration Tests should correctly identify multiple issues new-field-multiple-issues
/Users/tbarberb/projects/api/tests/eval/eval_test.go:216
STEP: verifying origin remote exists @ 11/27/25 17:17:29.742
STEP: resetting repo to clean state @ 11/27/25 17:17:29.757
STEP: copying local AGENTS.md to temp clone @ 11/27/25 17:17:29.915
STEP: copying local .claude directory to temp clone @ 11/27/25 17:17:29.915
STEP: reading and applying patch @ 11/27/25 17:17:29.916
STEP: running API review via Claude (claude-opus-4-5@20251101) @ 11/27/25 17:17:29.931
STEP: comparing results with Claude judge (claude-haiku-4-5@20251001) @ 11/27/25 17:18:31.13
Judge result: pass=false, reason=The review identified an extra issue not in the expected list: 'missing URL validation pattern for customLogoURL' was reported as a separate finding beyond the expected 'missing length validation for customLogoURL'. While the review correctly identified all 4 expected issues (optional field behavior, URL validation, FeatureGate, and length validation), it reported validation as a separate concern than length validation, which appears to be treating them as distinct problems rather than recognizing they may be the same underlying validation gap.
[FAILED] in [It] - /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:18:34.472
โข [FAILED] [64.730 seconds]
API Review Evaluation Integration Tests should correctly identify multiple issues [It] new-field-multiple-issues
/Users/tbarberb/projects/api/tests/eval/eval_test.go:216
[FAILED] API review did not match expected issues.
Judge reason: The review identified an extra issue not in the expected list: 'missing URL validation pattern for customLogoURL' was reported as a separate finding beyond the expected 'missing length validation for customLogoURL'. While the review correctly identified all 4 expected issues (optional field behavior, URL validation, FeatureGate, and length validation), it reported validation as a separate concern than length validation, which appears to be treating them as distinct problems rather than recognizing they may be the same underlying validation gap.
Review output:
## API Review Results
**File reviewed:** `config/v1/types_console.go`
**Linting:** โ
Passed (0 issues)
---
### โ Review FAILED - Documentation Issues Found
**+44: Missing optional field behavior documentation**
**Current (problematic) code:**
```go
// customLogoURL specifies a URL for a custom logo image.
// +optional
CustomLogoURL string `json:"customLogoURL,omitempty"`
```
**Suggested change:**
```diff
- // customLogoURL specifies a URL for a custom logo image.
+ // customLogoURL specifies a URL for a custom logo image.
+ // When omitted, the default OpenShift logo will be displayed.
```
**Explanation:** Optional fields MUST explain the behavior when the field is not present.
---
**+44: Missing URL validation**
**Current (problematic) code:**
```go
// customLogoURL specifies a URL for a custom logo image.
// +optional
CustomLogoURL string `json:"customLogoURL,omitempty"`
```
**Suggested change:**
```diff
// customLogoURL specifies a URL for a custom logo image.
+ // The URL must be a valid HTTPS URL.
+ // When omitted, the default OpenShift logo will be displayed.
// +optional
+ // +kubebuilder:validation:Pattern=`^$|^https://.*$`
CustomLogoURL string `json:"customLogoURL,omitempty"`
```
**Explanation:** URL fields should have validation to ensure proper format. Compare with `LogoutRedirect` field at line 83 which validates HTTPS URLs with a pattern.
---
**+44: New stable API field should be feature-gated**
**Current (problematic) code:**
```go
CustomLogoURL string `json:"customLogoURL,omitempty"`
```
**Suggested change:**
```diff
+ // +openshift:enable:FeatureGate=ConsoleCustomLogo
CustomLogoURL string `json:"customLogoURL,omitempty"`
```
**Explanation:** Per CLAUDE.md guidelines, new fields on stable APIs should be introduced behind a feature gate.
---
### Summary
| Check | Status |
|-------|--------|
| Linting | โ
Pass |
| Field Documentation | โ
Pass |
| Optional Behavior Documented | โ Fail |
| Validation Markers Present | โ Fail |
| Feature Gate for New Field | โ Fail |
Expected issues:
optional field customLogoURL does not explain behavior when omitted
missing URL validation pattern for customLogoURL
missing FeatureGate for new field on stable API
missing length validation for customLogoURL
Expected
<bool>: false
to be true
In [It] at: /Users/tbarberb/projects/api/tests/eval/eval_test.go:304 @ 11/27/25 17:18:34.472
------------------------------
[AfterSuite]
/Users/tbarberb/projects/api/tests/eval/eval_test.go:114
STEP: cleaning up temp directory @ 11/27/25 17:18:34.49
[AfterSuite] PASSED [0.853 seconds]
------------------------------
Summarizing 2 Failures:
[FAIL] API Review Evaluation Golden Tests should correctly identify single issues [It] valid-api-change
/Users/tbarberb/projects/api/tests/eval/eval_test.go:304
[FAIL] API Review Evaluation Integration Tests should correctly identify multiple issues [It] new-field-multiple-issues
/Users/tbarberb/projects/api/tests/eval/eval_test.go:304
Ran 4 of 4 Specs in 354.680 seconds
FAIL! -- 2 Passed | 2 Failed | 0 Pending | 0 Skipped
--- FAIL: TestEval (354.68s)
FAIL
Ginkgo ran 1 suite in 5m55.161278125s
Test Suite Failed
Pipeline controller notification
This repository is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.
For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.
This repository is configured in: LGTM mode
Hello @theobarberbany! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.
[!IMPORTANT]
Review skipped
Auto reviews are limited based on label configuration.
:no_entry_sign: Excluded labels (none allowed) (1)
- do-not-merge/work-in-progress
Please check the settings in the CodeRabbit UI or the
.coderabbit.yamlfile in this repository. To trigger a single review, invoke the@coderabbitai reviewcommand.You can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
โจ Finishing touches
๐งช Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
Comment @coderabbitai help to get the list of available commands and usage tips.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign everettraven for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@theobarberbany: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/integration | a999d82f40e1c46f1ee946c5b3c891e5614007ab | link | true | /test integration |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.