feat(max): Query planning using Claude interleaved thinking
Problem
Currently, Max chooses the kind of the query to generate only based on a general query description, before it gets to consider the actual logic of the query:
This means Max trips up on question that aren't answerable trends/funnels/retention due to the limitations of those insights â what it should be doing is using the SQL as fallback when this is necessary.
Changes
Re-architecture.
- Instead of a separate planner node for each insight type: a shared QueryPlannerNode.
- Instead of insight type decided by RootNode: QueryPlannerNode decides, knowing the full JSON schemas of each insight type other than SQL.
- Instead of ReAct: interleaved thinking along an upgrade to Claude Sonnet 4.
Did you write or update any docs for this change?
- [ ] I've added or updated the docs
- [ ] I've reached out for help from the docs team
- [x] No docs needed for this change
How did you test this code?
Since #32751, evals also score choice of query kind â we should have no regression in the existing dataset.
đ¸ UI snapshots have been updated
1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
chromium: 0 added, 1 modified, 0 deleted (diff for shard 3)webkit: 0 added, 0 modified, 0 deleted
Triggered by this commit.
đ¸ UI snapshots have been updated
1 snapshot changes in total. 0 added, 1 modified, 0 deleted:
[!CAUTION]
Detected flapping snapshots
These snapshots have auto-updated more than once since the last human commit:
scenes-app-experiments--experiment-with-funnel-metric--light.png(chromium, shard 3)The flippy-flappies are deadly and must be fixed ASAP. They're productivity killers. Run
pnpm storybooklocally and make the fix now. (Often, the cause isResizeObserverbeing used instead of the better CSS container queries.)
chromium: 0 added, 1 modified, 0 deleted (diff for shard 3)webkit: 0 added, 0 modified, 0 deleted
Triggered by this commit.
Size Change: 0 B
Total Size: 2.61 MB
âšī¸ View Unchanged
| Filename | Size |
|---|---|
frontend/dist/toolbar.js |
2.61 MB |
@Twixes The new approach clearly works better with local data. Below are some critical thinking and suggestions for improvements. Definitely not opposed to shipping it, so feel free to ignore it.
My main concerns are the limits, generation price, and latency:
- We're still on 200k TPM. Let's reach out to AWS first to see if we can get more. My attempts with Anthropic were unsuccessful.
- The cost per turn is now around 13 cents (with cache), not including DWH, vs 3 cents before. We can shave tokens by transforming JSON schemas to a prompt (or reusing previous prompts) and querying/searching DWH on demand if we know the DWH schema is large.
- The latency is ~14-20s vs 4 seconds before. We should evaluate whether extended thinking on every turn significantly improves quality.
Potentially, the last two points are optimizable by parallel function calling, as you mentioned, but runs still get pretty expensive.
Some broader thoughts
Lack of evals
This change is very difficult to evaluate without a baseline.
Schemas in the prompt
The schemas in the prompt feel suboptimal. It is hard to scale for more insight types including event/action/cohort taxonomy and DWH schema. Additionally, since we included schemas in the prompt, we could just use them as tools removing the generator step. I guess it should work +- the same. The schema generator was separated to shave tokens in the first place.
I think we can do a better job by writing prompts for insight types. I'm not sure that the schemas provide more value than the prompt.
Router
We should try to achieve better selection through a fast router model (for example, Gemini Flash). We have never really tried to do this, but the insight selection logic is possible to describe:
- PostHog-native/Data Warehouse data, supported aggregation, and supported visualizations -> Insights.
- PostHog-native/Data Warehouse data, unsupported aggregation, and supported visualizations -> SQL.
- Otherwise, ask the user for guidance.
Existing eval cases don't seem uncrackable for the router. In case the taxonomy agent fails, we can fall back to the router, but the only case I can now see is a wrong selection of the insight type (for example, trends instead of SQL). This is worth trying.
Lightweight updates
I frequently find in traces queries that don't require taxonomy lookups but rather aggregation/cosmetic/logic changes, which are easy to do using plans. This makes me think that maybe we need to include taxonomy in the prompt for the initial generation but move it out to a tool in the subsequent steps. The same applies to the DWH schema. I haven't explored this in detail yet, and the new planner has not solved it.
Suggestions
- I would drop the
final_answertool in favor of the last message. It should generally reduce latency and increase accuracy, and we don't need a structured output for the plan. - Include the DWH schema in the query planner prompt. The prompt mentions that it is included but wasn't for me.
- "ULTRA THINK".
- Maybe a "think" tool instead of extending thinking on every turn.
đ§ AI eval results
Evaluated 6 experiments, comprising 3 metrics.
funnel
Avg. case performance: âąī¸ 30.28 s, đĸ 0 tokens
memory
đ ToolRelevance: 97.87% đ memory_content_relevance: 88.57%
Avg. case performance: âąī¸ 6.58 s, đĸ 1212 tokens, đĩ $0.0033 in tokens
retention
Avg. case performance: âąī¸ 9.09 s, đĸ 0 tokens
root
đ ToolRelevance: 61.63%
Avg. case performance: âąī¸ 5.20 s, đĸ 0 tokens
sql
Avg. case performance: âąī¸ 4.35 s, đĸ 0 tokens
trends
Avg. case performance: âąī¸ 9.88 s, đĸ 0 tokens
Triggered by this commit.
đ§ AI eval results
Evaluated 6 experiments, comprising 3 metrics.
funnel
Avg. case performance: âąī¸ 29.06 s, đĸ 0 tokens
memory
đ ToolRelevance: 98.96% đ memory_content_relevance: 95.71%
Avg. case performance: âąī¸ 5.77 s, đĸ 1214 tokens, đĩ $0.0034 in tokens
retention
Avg. case performance: âąī¸ 9.14 s, đĸ 0 tokens
root
đ ToolRelevance: 66.08%
Avg. case performance: âąī¸ 3.87 s, đĸ 0 tokens
sql
Avg. case performance: âąī¸ 3.16 s, đĸ 0 tokens
trends
Avg. case performance: âąī¸ 11.86 s, đĸ 0 tokens
Triggered by this commit.
Update after various tweaks:
- Tried generating insights straight with the query planner, but found the results to be lower quality than with the plan as an intermediary step. â
- Went for o4-mini for speed, and added extraction of reasoning headlines for the Max UI. The results seem quite good. Slower than ReAct, but better. â
For comparison, the ReAct version doesn't "consider" whether that sidebar activation event is what the user meant, resulting in low quality output:
đ¸ UI snapshots have been updated
18 snapshot changes in total. 0 added, 18 modified, 0 deleted:
chromium: 0 added, 18 modified, 0 deleted (diff for shard 12)webkit: 0 added, 0 modified, 0 deleted
Triggered by this commit.
đ¸ UI snapshots have been updated
104 snapshot changes in total. 0 added, 104 modified, 0 deleted:
chromium: 0 added, 104 modified, 0 deleted (diff for shard 11, diff for shard 2, diff for shard 1)webkit: 0 added, 0 modified, 0 deleted
Triggered by this commit.
đ§ AI eval results
Evaluated 9 experiments, comprising 9 metrics.
funnel
đ´ plan_correctness: 0.00%, -90.58% versus baseline (master) (improvements: 0, regressions: 2)
Avg. case performance: âąī¸ 33.08 s, đĸ 0 tokens
memory
đĩ ToolRelevance: 98.17%, -0.05% versus baseline (master) (improvements: 1, regressions: 2) đ´ memory_content_relevance: 90.00%, -1.43% versus baseline (master) (improvements: 0, regressions: 1)
Avg. case performance: âąī¸ 6.01 s, đĸ 1217 tokens, đĩ $0.0034 in tokens
retention
Avg. case performance: âąī¸ 9.47 s, đĸ 0 tokens
root
đĸ ToolRelevance: 61.74%, +6.15% versus baseline (master) (improvements: 2, regressions: 1)
Avg. case performance: âąī¸ 5.37 s, đĸ 0 tokens
sql
đ´ plan_correctness: 0.00%, -46.19% versus baseline (master) (improvements: 0, regressions: 2)
Avg. case performance: âąī¸ 14.70 s, đĸ 0 tokens
tool_generate_hogql_query
đ´ EmbeddingSimilarity: 75.75%, -5.68% versus baseline (master) (improvements: 0, regressions: 1) đĩ sql_syntax_correctness: 100.00%, Âą0.00% versus baseline (master) (improvements: 0, regressions: 0)
Avg. case performance: âąī¸ 1.55 s, đĸ 1365 tokens, đĩ $0.0030 in tokens
trends
Avg. case performance: âąī¸ 12.30 s, đĸ 0 tokens
ui_context_actions
đ´ ToolRelevance: 68.35%, -25.90% versus baseline (master) (improvements: 0, regressions: 2)
Avg. case performance: âąī¸ 3.13 s, đĸ 0 tokens
ui_context_events
đ´ ToolRelevance: 67.31%, -15.36% versus baseline (master) (improvements: 1, regressions: 2)
Avg. case performance: âąī¸ 2.95 s, đĸ 0 tokens
Triggered by this commit.
đ¸ UI snapshots have been updated
36 snapshot changes in total. 0 added, 36 modified, 0 deleted:
chromium: 0 added, 36 modified, 0 deleted (diff for shard 5, diff for shard 12, diff for shard 9, diff for shard 11, diff for shard 8, diff for shard 16, diff for shard 2, diff for shard 1)webkit: 0 added, 0 modified, 0 deleted
Triggered by this commit.
đ§ AI eval results
Evaluated 9 experiments, comprising 9 metrics.
funnel
đ plan_correctness: 0.00%
Avg. case performance: âąī¸ 30.51 s, đĸ 0 tokens
memory
đĩ ToolRelevance: 98.22%, +0.00% versus baseline (master) (improvements: 1, regressions: 2) đ´ memory_content_relevance: 88.57%, -2.86% versus baseline (master) (improvements: 0, regressions: 1)
Avg. case performance: âąī¸ 6.93 s, đĸ 1218 tokens, đĩ $0.0034 in tokens
retention
Avg. case performance: âąī¸ 9.38 s, đĸ 0 tokens
root
đĸ ToolRelevance: 61.90%, +6.31% versus baseline (master) (improvements: 2, regressions: 1)
Avg. case performance: âąī¸ 6.03 s, đĸ 0 tokens
sql
đ´ plan_correctness: 0.00%, -46.19% versus baseline (master) (improvements: 0, regressions: 2)
Avg. case performance: âąī¸ 13.50 s, đĸ 0 tokens
tool_generate_hogql_query
đĩ EmbeddingSimilarity: 81.42%, Âą0.00% versus baseline (master) (improvements: 0, regressions: 0) đĩ sql_syntax_correctness: 100.00%, Âą0.00% versus baseline (master) (improvements: 0, regressions: 0)
Avg. case performance: âąī¸ 1.42 s, đĸ 1349 tokens, đĩ $0.0029 in tokens
trends
Avg. case performance: âąī¸ 13.69 s, đĸ 0 tokens
ui_context_actions
đ´ ToolRelevance: 68.09%, -26.16% versus baseline (master) (improvements: 0, regressions: 2)
Avg. case performance: âąī¸ 3.48 s, đĸ 0 tokens
ui_context_events
đ´ ToolRelevance: 67.40%, -15.27% versus baseline (master) (improvements: 1, regressions: 2)
Avg. case performance: âąī¸ 3.21 s, đĸ 0 tokens
Triggered by this commit.
đ§ AI eval results
Evaluated 9 experiments, comprising 25 metrics.
funnel
đ QueryKindSelection: 73.33% đ plan_correctness: 60.58% đ query_and_plan_alignment: 81.11% đ time_range_relevancy: 95.22%
Avg. case performance: âąī¸ 188.89 s, đĸ 4941 tokens, đĩ $0.0137 in tokens
memory
đ ToolRelevance: 98.13% đ memory_content_relevance: 91.43%
Avg. case performance: âąī¸ 5.69 s, đĸ 1214 tokens, đĩ $0.0034 in tokens
retention
đ QueryKindSelection: 66.67% đ plan_correctness: 33.33% đ query_and_plan_alignment: 75.83% đ time_range_relevancy: 100.00%
Avg. case performance: âąī¸ 46.41 s, đĸ 3421 tokens, đĩ $0.0188 in tokens
root
đ ToolRelevance: 61.93%
Avg. case performance: âąī¸ 5.29 s, đĸ 0 tokens
sql
đĸ QueryKindSelection: 100.00%, +16.67% versus baseline (master) (improvements: 1, regressions: 0) đ´ plan_correctness: 42.41%, -4.26% versus baseline (master) (improvements: 4, regressions: 3) đĸ query_and_plan_alignment: 89.67%, +24.67% versus baseline (master) (improvements: 3, regressions: 0) đĸ retry_efficiency: 95.06%, +2.47% versus baseline (master) (improvements: 2, regressions: 1) đĩ sql_syntax_correctness: 100.00%, Âą0.00% versus baseline (master) (improvements: 0, regressions: 0) đ´ time_range_relevancy: 96.67%, -3.33% versus baseline (master) (improvements: 0, regressions: 0)
Avg. case performance: âąī¸ 51.40 s, đĸ 962 tokens, đĩ $0.0041 in tokens
tool_generate_hogql_query
đ EmbeddingSimilarity: 78.25% đ sql_syntax_correctness: 100.00%
Avg. case performance: âąī¸ 1.65 s, đĸ 1411 tokens, đĩ $0.0031 in tokens
trends
đ QueryKindSelection: 80.00% đ plan_correctness: 77.71% đ query_and_plan_alignment: 81.00% đ time_range_relevancy: 96.25%
Avg. case performance: âąī¸ 67.84 s, đĸ 10240 tokens, đĩ $0.0220 in tokens
ui_context_actions
đ ToolRelevance: 68.01%
Avg. case performance: âąī¸ 4.36 s, đĸ 0 tokens
ui_context_events
đ ToolRelevance: 67.33%
Avg. case performance: âąī¸ 3.07 s, đĸ 0 tokens
Triggered by this commit.
đ§ AI eval results
Evaluated 9 experiments, comprising 25 metrics.
funnel
đ QueryKindSelection: 76.00% đ plan_correctness: 69.92% đ query_and_plan_alignment: 82.90% đ time_range_relevancy: 96.50%
Avg. case performance: âąī¸ 60.54 s, đĸ 6239 tokens, đĩ $0.0156 in tokens
memory
đ ToolRelevance: 88.20% đ memory_content_relevance: 80.48%
Avg. case performance: âąī¸ 5.07 s, đĸ 1302 tokens, đĩ $0.0036 in tokens
retention
đ QueryKindSelection: 75.00% đ plan_correctness: 11.67% đ query_and_plan_alignment: 100.00% đ time_range_relevancy: 100.00%
Avg. case performance: âąī¸ 47.47 s, đĸ 2841 tokens, đĩ $0.0234 in tokens
root
đ ToolRelevance: 60.58%
Avg. case performance: âąī¸ 6.60 s, đĸ 0 tokens
sql
đ QueryKindSelection: 100.00% đ plan_correctness: 41.48% đ query_and_plan_alignment: 82.00% đ retry_efficiency: 98.77% đ sql_syntax_correctness: 100.00% đ time_range_relevancy: 94.33%
Avg. case performance: âąī¸ 39.94 s, đĸ 797 tokens, đĩ $0.0034 in tokens
tool_generate_hogql_query
đ EmbeddingSimilarity: 72.54% đ sql_syntax_correctness: 100.00%
Avg. case performance: âąī¸ 7.55 s, đĸ 1438 tokens, đĩ $0.0033 in tokens
trends
đ QueryKindSelection: 43.48% đ plan_correctness: 90.83% đ query_and_plan_alignment: 64.57% đ time_range_relevancy: 96.74%
Avg. case performance: âąī¸ 45.41 s, đĸ 4945 tokens, đĩ $0.0108 in tokens
ui_context_actions
đ ToolRelevance: 56.84%
Avg. case performance: âąī¸ 8.67 s, đĸ 0 tokens
ui_context_events
đ ToolRelevance: 67.38%
Avg. case performance: âąī¸ 7.45 s, đĸ 0 tokens
Triggered by this commit.
This issue has 2052 words at 13 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
- Write some code and submit a pull request! Code wins arguments
- Have a sync meeting to reach a conclusion
- Create a Request for Comments and submit a PR with it to the meta repo or product internal repo
Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.
đ§ AI eval results
Evaluated 9 experiments, comprising 25 metrics.
funnel
đ QueryKindSelection: 68.00% đ plan_correctness: 71.58% đ query_and_plan_alignment: 77.50% đ time_range_relevancy: 95.00%
Avg. case performance: âąī¸ 59.62 s, đĸ 5837 tokens, đĩ $0.0151 in tokens
memory
đ ToolRelevance: 91.56% đ memory_content_relevance: 85.24%
Avg. case performance: âąī¸ 12.04 s, đĸ 1303 tokens, đĩ $0.0036 in tokens
retention
đ QueryKindSelection: 85.71% đ plan_correctness: 37.67% đ query_and_plan_alignment: 77.14% đ time_range_relevancy: 100.00%
Avg. case performance: âąī¸ 73.54 s, đĸ 4658 tokens, đĩ $0.0194 in tokens
root
đ ToolRelevance: 60.33%
Avg. case performance: âąī¸ 4.70 s, đĸ 0 tokens
sql
đ QueryKindSelection: 100.00% đ plan_correctness: 39.44% đ query_and_plan_alignment: 82.00% đ retry_efficiency: 96.30% đ sql_syntax_correctness: 96.67% đ time_range_relevancy: 92.00%
Avg. case performance: âąī¸ 41.41 s, đĸ 985 tokens, đĩ $0.0044 in tokens
tool_generate_hogql_query
đ EmbeddingSimilarity: 77.83% đ sql_syntax_correctness: 100.00%
Avg. case performance: âąī¸ 2.79 s, đĸ 1414 tokens, đĩ $0.0031 in tokens
trends
đ QueryKindSelection: 47.83% đ plan_correctness: 90.42% đ query_and_plan_alignment: 65.65% đ time_range_relevancy: 96.74%
Avg. case performance: âąī¸ 51.45 s, đĸ 5743 tokens, đĩ $0.0124 in tokens
ui_context_actions
đ ToolRelevance: 68.53%
Avg. case performance: âąī¸ 2.84 s, đĸ 0 tokens
ui_context_events
đ ToolRelevance: 67.52%
Avg. case performance: âąī¸ 2.53 s, đĸ 0 tokens
Triggered by this commit.
đ§ AI eval results
Evaluated 9 experiments, comprising 25 metrics.
funnel
đ´ QueryKindSelection: 74.47%, -25.53% versus baseline (master) (improvements: 0, regressions: 7) đ´ plan_correctness: 66.00%, -22.25% versus baseline (master) (improvements: 3, regressions: 13) đ´ query_and_plan_alignment: 82.45%, -12.55% versus baseline (master) (improvements: 6, regressions: 7) đĸ time_range_relevancy: 95.85%, +5.02% versus baseline (master) (improvements: 4, regressions: 2)
Avg. case performance: âąī¸ 75.84 s, đĸ 6225 tokens, đĩ $0.0159 in tokens
memory
đĸ ToolRelevance: 96.17%, +4.55% versus baseline (master) (improvements: 1, regressions: 2) đĸ memory_content_relevance: 88.57%, +3.33% versus baseline (master) (improvements: 1, regressions: 1)
Avg. case performance: âąī¸ 4.92 s, đĸ 1301 tokens, đĩ $0.0036 in tokens
retention
đ´ QueryKindSelection: 50.00%, -50.00% versus baseline (master) (improvements: 0, regressions: 3) đ´ plan_correctness: 24.33%, -46.67% versus baseline (master) (improvements: 0, regressions: 5) đĸ query_and_plan_alignment: 90.83%, +31.90% versus baseline (master) (improvements: 3, regressions: 0) đĸ time_range_relevancy: 100.00%, +3.57% versus baseline (master) (improvements: 0, regressions: 0)
Avg. case performance: âąī¸ 56.36 s, đĸ 2738 tokens, đĩ $0.0158 in tokens
root
đ´ ToolRelevance: 60.78%, -14.82% versus baseline (master) (improvements: 0, regressions: 3)
Avg. case performance: âąī¸ 10.04 s, đĸ 0 tokens
sql
đĸ QueryKindSelection: 100.00%, +10.53% versus baseline (master) (improvements: 1, regressions: 0) đ´ plan_correctness: 35.37%, -16.23% versus baseline (master) (improvements: 2, regressions: 3) đĸ query_and_plan_alignment: 78.57%, +13.31% versus baseline (master) (improvements: 5, regressions: 0) đĸ retry_efficiency: 93.83%, +3.16% versus baseline (master) (improvements: 2, regressions: 1) đĸ sql_syntax_correctness: 100.00%, +2.94% versus baseline (master) (improvements: 0, regressions: 0) đĸ time_range_relevancy: 93.93%, +4.45% versus baseline (master) (improvements: 3, regressions: 0)
Avg. case performance: âąī¸ 47.01 s, đĸ 1124 tokens, đĩ $0.0050 in tokens
tool_generate_hogql_query
đĸ EmbeddingSimilarity: 83.40%, +11.23% versus baseline (master) (improvements: 2, regressions: 0) đĩ sql_syntax_correctness: 100.00%, Âą0.00% versus baseline (master) (improvements: 0, regressions: 0)
Avg. case performance: âąī¸ 4.47 s, đĸ 1679 tokens, đĩ $0.0038 in tokens
trends
đ´ QueryKindSelection: 66.67%, -24.64% versus baseline (master) (improvements: 1, regressions: 5) đĸ plan_correctness: 94.38%, +8.96% versus baseline (master) (improvements: 3, regressions: 1) đĩ query_and_plan_alignment: 77.08%, +0.56% versus baseline (master) (improvements: 4, regressions: 3) đ´ time_range_relevancy: 94.79%, -1.95% versus baseline (master) (improvements: 1, regressions: 1)
Avg. case performance: âąī¸ 51.22 s, đĸ 7708 tokens, đĩ $0.0159 in tokens
ui_context_actions
đĸ ToolRelevance: 57.48%, +10.57% versus baseline (master) (improvements: 1, regressions: 1)
Avg. case performance: âąī¸ 3.28 s, đĸ 0 tokens
ui_context_events
đ´ ToolRelevance: 67.57%, -4.27% versus baseline (master) (improvements: 1, regressions: 2)
Avg. case performance: âąī¸ 3.69 s, đĸ 0 tokens
Triggered by this commit.