ci(max): Only run evals with `evals-ready` label
Problem
It's a little expensive to run AI evals on every commit, even if only on Max AI PRs. (Slack thread.)
Changes
As in the Slack thread, making it so that evals run on PRs only if the evals-ready label is present. Upside: you can now also request evals on draft PRs.
How did you test this code?
Using here.
🧠 AI eval results
Evaluated 6 experiments, comprising 19 metrics.
funnel
🆕 QueryKindSelection: 100.00% 🆕 plan_correctness: 90.50% 🆕 query_and_plan_alignment: 88.67% 🆕 time_range_relevancy: 94.67%
Avg. case performance: ⏱️ 110.92 s, 🔢 6369 tokens, 💵 $0.0167 in tokens
memory
🆕 ToolRelevance: 97.89% 🆕 memory_content_relevance: 91.43%
Avg. case performance: ⏱️ 7.00 s, 🔢 1217 tokens, 💵 $0.0034 in tokens
retention
🆕 QueryKindSelection: 100.00% 🆕 plan_correctness: 65.00% 🆕 query_and_plan_alignment: 53.57% 🆕 time_range_relevancy: 94.64%
Avg. case performance: ⏱️ 32.24 s, 🔢 5759 tokens, 💵 $0.0161 in tokens
root
🆕 ToolRelevance: 65.02%
Avg. case performance: ⏱️ 5.85 s, 🔢 0 tokens
sql
🆕 QueryKindSelection: 0.00% 🆕 plan_correctness: 100.00% 🆕 query_and_plan_alignment: 50.00% 🆕 time_range_relevancy: 100.00%
Avg. case performance: ⏱️ 14.73 s, 🔢 16263 tokens, 💵 $0.0416 in tokens
trends
🆕 QueryKindSelection: 100.00% 🆕 plan_correctness: 78.57% 🆕 query_and_plan_alignment: 85.00% 🆕 time_range_relevancy: 96.25%
Avg. case performance: ⏱️ 52.89 s, 🔢 9937 tokens, 💵 $0.0268 in tokens
Triggered by this commit.