ci(max): Only run evals with `evals-ready` label

Open Twixes opened this issue 7 months ago • 1 comments

Problem

It's a little expensive to run AI evals on every commit, even if only on Max AI PRs. (Slack thread.)

Changes

As in the Slack thread, making it so that evals run on PRs only if the evals-ready label is present. Upside: you can now also request evals on draft PRs.

How did you test this code?

Using here.

Jun 13 '25 16:06 Twixes

🧠 AI eval results

Evaluated 6 experiments, comprising 19 metrics.

funnel

🆕 QueryKindSelection: 100.00% 🆕 plan_correctness: 90.50% 🆕 query_and_plan_alignment: 88.67% 🆕 time_range_relevancy: 94.67%

Avg. case performance: ⏱️ 110.92 s, 🔢 6369 tokens, 💵 $0.0167 in tokens

memory

🆕 ToolRelevance: 97.89% 🆕 memory_content_relevance: 91.43%

Avg. case performance: ⏱️ 7.00 s, 🔢 1217 tokens, 💵 $0.0034 in tokens

retention

🆕 QueryKindSelection: 100.00% 🆕 plan_correctness: 65.00% 🆕 query_and_plan_alignment: 53.57% 🆕 time_range_relevancy: 94.64%

Avg. case performance: ⏱️ 32.24 s, 🔢 5759 tokens, 💵 $0.0161 in tokens

root

🆕 ToolRelevance: 65.02%

Avg. case performance: ⏱️ 5.85 s, 🔢 0 tokens

sql

🆕 QueryKindSelection: 0.00% 🆕 plan_correctness: 100.00% 🆕 query_and_plan_alignment: 50.00% 🆕 time_range_relevancy: 100.00%

Avg. case performance: ⏱️ 14.73 s, 🔢 16263 tokens, 💵 $0.0416 in tokens

trends

🆕 QueryKindSelection: 100.00% 🆕 plan_correctness: 78.57% 🆕 query_and_plan_alignment: 85.00% 🆕 time_range_relevancy: 96.25%

Avg. case performance: ⏱️ 52.89 s, 🔢 9937 tokens, 💵 $0.0268 in tokens

Triggered by this commit.

Jun 13 '25 17:06 posthog-bot