posthog feat(max): Query planning using Claude interleaved thinking

Problem

Currently, Max chooses the kind of the query to generate only based on a general query description, before it gets to consider the actual logic of the query:

This means Max trips up on question that aren't answerable trends/funnels/retention due to the limitations of those insights – what it should be doing is using the SQL as fallback when this is necessary.

Changes

Re-architecture.

Instead of a separate planner node for each insight type: a shared QueryPlannerNode.
Instead of insight type decided by RootNode: QueryPlannerNode decides, knowing the full JSON schemas of each insight type other than SQL.
Instead of ReAct: interleaved thinking along an upgrade to Claude Sonnet 4.

Did you write or update any docs for this change?

[ ] I've added or updated the docs
[ ] I've reached out for help from the docs team
[x] No docs needed for this change

How did you test this code?

Since #32751, evals also score choice of query kind – we should have no regression in the existing dataset.

Jun 06 '25 08:06 Twixes

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

chromium: 0 added, 1 modified, 0 deleted (diff for shard 3)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Jun 06 '25 13:06 posthog-bot

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

[!CAUTION]

Detected flapping snapshots

These snapshots have auto-updated more than once since the last human commit:

scenes-app-experiments--experiment-with-funnel-metric--light.png (chromium, shard 3)

The flippy-flappies are deadly and must be fixed ASAP. They're productivity killers. Run pnpm storybook locally and make the fix now. (Often, the cause is ResizeObserver being used instead of the better CSS container queries.)

chromium: 0 added, 1 modified, 0 deleted (diff for shard 3)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Jun 06 '25 13:06 posthog-bot

Size Change: 0 B

Total Size: 2.61 MB

ℹ️ View Unchanged

Filename	Size
`frontend/dist/toolbar.js`	2.61 MB

_{compressed-size-action}

Jun 11 '25 08:06 github-actions[bot]

@Twixes The new approach clearly works better with local data. Below are some critical thinking and suggestions for improvements. Definitely not opposed to shipping it, so feel free to ignore it.

My main concerns are the limits, generation price, and latency:

We're still on 200k TPM. Let's reach out to AWS first to see if we can get more. My attempts with Anthropic were unsuccessful.
The cost per turn is now around 13 cents (with cache), not including DWH, vs 3 cents before. We can shave tokens by transforming JSON schemas to a prompt (or reusing previous prompts) and querying/searching DWH on demand if we know the DWH schema is large.
The latency is ~14-20s vs 4 seconds before. We should evaluate whether extended thinking on every turn significantly improves quality.

Potentially, the last two points are optimizable by parallel function calling, as you mentioned, but runs still get pretty expensive.

Some broader thoughts

Lack of evals

This change is very difficult to evaluate without a baseline.

Schemas in the prompt

The schemas in the prompt feel suboptimal. It is hard to scale for more insight types including event/action/cohort taxonomy and DWH schema. Additionally, since we included schemas in the prompt, we could just use them as tools removing the generator step. I guess it should work +- the same. The schema generator was separated to shave tokens in the first place.

I think we can do a better job by writing prompts for insight types. I'm not sure that the schemas provide more value than the prompt.

Router

We should try to achieve better selection through a fast router model (for example, Gemini Flash). We have never really tried to do this, but the insight selection logic is possible to describe:

PostHog-native/Data Warehouse data, supported aggregation, and supported visualizations -> Insights.
PostHog-native/Data Warehouse data, unsupported aggregation, and supported visualizations -> SQL.
Otherwise, ask the user for guidance.

Existing eval cases don't seem uncrackable for the router. In case the taxonomy agent fails, we can fall back to the router, but the only case I can now see is a wrong selection of the insight type (for example, trends instead of SQL). This is worth trying.

Lightweight updates

I frequently find in traces queries that don't require taxonomy lookups but rather aggregation/cosmetic/logic changes, which are easy to do using plans. This makes me think that maybe we need to include taxonomy in the prompt for the initial generation but move it out to a tool in the subsequent steps. The same applies to the DWH schema. I haven't explored this in detail yet, and the new planner has not solved it.

Suggestions

I would drop the final_answer tool in favor of the last message. It should generally reduce latency and increase accuracy, and we don't need a structured output for the plan.
Include the DWH schema in the query planner prompt. The prompt mentions that it is included but wasn't for me.
"ULTRA THINK".
Maybe a "think" tool instead of extending thinking on every turn.

Jun 13 '25 16:06 skoob13

🧠 AI eval results

Evaluated 6 experiments, comprising 3 metrics.

funnel

Avg. case performance: ⏱️ 30.28 s, 🔢 0 tokens

memory

🆕 ToolRelevance: 97.87% 🆕 memory_content_relevance: 88.57%

Avg. case performance: ⏱️ 6.58 s, 🔢 1212 tokens, 💵 $0.0033 in tokens

retention

Avg. case performance: ⏱️ 9.09 s, 🔢 0 tokens

root

🆕 ToolRelevance: 61.63%

Avg. case performance: ⏱️ 5.20 s, 🔢 0 tokens

sql

Avg. case performance: ⏱️ 4.35 s, 🔢 0 tokens

trends

Avg. case performance: ⏱️ 9.88 s, 🔢 0 tokens

Triggered by this commit.

Jun 14 '25 17:06 posthog-bot

🧠 AI eval results

Evaluated 6 experiments, comprising 3 metrics.

funnel

Avg. case performance: ⏱️ 29.06 s, 🔢 0 tokens

memory

🆕 ToolRelevance: 98.96% 🆕 memory_content_relevance: 95.71%

Avg. case performance: ⏱️ 5.77 s, 🔢 1214 tokens, 💵 $0.0034 in tokens

retention

Avg. case performance: ⏱️ 9.14 s, 🔢 0 tokens

root

🆕 ToolRelevance: 66.08%

Avg. case performance: ⏱️ 3.87 s, 🔢 0 tokens

sql

Avg. case performance: ⏱️ 3.16 s, 🔢 0 tokens

trends

Avg. case performance: ⏱️ 11.86 s, 🔢 0 tokens

Triggered by this commit.

Jun 16 '25 18:06 posthog-bot

Update after various tweaks:

Tried generating insights straight with the query planner, but found the results to be lower quality than with the plan as an intermediary step. ❌
Went for o4-mini for speed, and added extraction of reasoning headlines for the Max UI. The results seem quite good. Slower than ReAct, but better. ✅ For comparison, the ReAct version doesn't "consider" whether that sidebar activation event is what the user meant, resulting in low quality output:

Jun 20 '25 15:06 Twixes

📸 UI snapshots have been updated

18 snapshot changes in total. 0 added, 18 modified, 0 deleted:

chromium: 0 added, 18 modified, 0 deleted (diff for shard 12)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Jun 25 '25 18:06 posthog-bot

📸 UI snapshots have been updated

104 snapshot changes in total. 0 added, 104 modified, 0 deleted:

chromium: 0 added, 104 modified, 0 deleted (diff for shard 11, diff for shard 2, diff for shard 1)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Jun 25 '25 21:06 posthog-bot

🧠 AI eval results

Evaluated 9 experiments, comprising 9 metrics.

funnel

🔴 plan_correctness: 0.00%, -90.58% versus baseline (master) (improvements: 0, regressions: 2)

Avg. case performance: ⏱️ 33.08 s, 🔢 0 tokens

memory

🔵 ToolRelevance: 98.17%, -0.05% versus baseline (master) (improvements: 1, regressions: 2) 🔴 memory_content_relevance: 90.00%, -1.43% versus baseline (master) (improvements: 0, regressions: 1)

Avg. case performance: ⏱️ 6.01 s, 🔢 1217 tokens, 💵 $0.0034 in tokens

retention

Avg. case performance: ⏱️ 9.47 s, 🔢 0 tokens

root

🟢 ToolRelevance: 61.74%, +6.15% versus baseline (master) (improvements: 2, regressions: 1)

Avg. case performance: ⏱️ 5.37 s, 🔢 0 tokens

sql

🔴 plan_correctness: 0.00%, -46.19% versus baseline (master) (improvements: 0, regressions: 2)

Avg. case performance: ⏱️ 14.70 s, 🔢 0 tokens

tool_generate_hogql_query

🔴 EmbeddingSimilarity: 75.75%, -5.68% versus baseline (master) (improvements: 0, regressions: 1) 🔵 sql_syntax_correctness: 100.00%, ±0.00% versus baseline (master) (improvements: 0, regressions: 0)

Avg. case performance: ⏱️ 1.55 s, 🔢 1365 tokens, 💵 $0.0030 in tokens

trends

Avg. case performance: ⏱️ 12.30 s, 🔢 0 tokens

ui_context_actions

🔴 ToolRelevance: 68.35%, -25.90% versus baseline (master) (improvements: 0, regressions: 2)

Avg. case performance: ⏱️ 3.13 s, 🔢 0 tokens

ui_context_events

🔴 ToolRelevance: 67.31%, -15.36% versus baseline (master) (improvements: 1, regressions: 2)

Avg. case performance: ⏱️ 2.95 s, 🔢 0 tokens

Triggered by this commit.

Jun 26 '25 23:06 posthog-bot

📸 UI snapshots have been updated

36 snapshot changes in total. 0 added, 36 modified, 0 deleted:

chromium: 0 added, 36 modified, 0 deleted (diff for shard 5, diff for shard 12, diff for shard 9, diff for shard 11, diff for shard 8, diff for shard 16, diff for shard 2, diff for shard 1)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Jun 27 '25 10:06 posthog-bot

🧠 AI eval results

Evaluated 9 experiments, comprising 9 metrics.

funnel

🆕 plan_correctness: 0.00%

Avg. case performance: ⏱️ 30.51 s, 🔢 0 tokens

memory

🔵 ToolRelevance: 98.22%, +0.00% versus baseline (master) (improvements: 1, regressions: 2) 🔴 memory_content_relevance: 88.57%, -2.86% versus baseline (master) (improvements: 0, regressions: 1)

Avg. case performance: ⏱️ 6.93 s, 🔢 1218 tokens, 💵 $0.0034 in tokens

retention

Avg. case performance: ⏱️ 9.38 s, 🔢 0 tokens

root

🟢 ToolRelevance: 61.90%, +6.31% versus baseline (master) (improvements: 2, regressions: 1)

Avg. case performance: ⏱️ 6.03 s, 🔢 0 tokens

sql

🔴 plan_correctness: 0.00%, -46.19% versus baseline (master) (improvements: 0, regressions: 2)

Avg. case performance: ⏱️ 13.50 s, 🔢 0 tokens

tool_generate_hogql_query

🔵 EmbeddingSimilarity: 81.42%, ±0.00% versus baseline (master) (improvements: 0, regressions: 0) 🔵 sql_syntax_correctness: 100.00%, ±0.00% versus baseline (master) (improvements: 0, regressions: 0)

Avg. case performance: ⏱️ 1.42 s, 🔢 1349 tokens, 💵 $0.0029 in tokens

trends

Avg. case performance: ⏱️ 13.69 s, 🔢 0 tokens

ui_context_actions

🔴 ToolRelevance: 68.09%, -26.16% versus baseline (master) (improvements: 0, regressions: 2)

Avg. case performance: ⏱️ 3.48 s, 🔢 0 tokens

ui_context_events

🔴 ToolRelevance: 67.40%, -15.27% versus baseline (master) (improvements: 1, regressions: 2)

Avg. case performance: ⏱️ 3.21 s, 🔢 0 tokens

Triggered by this commit.

Jun 27 '25 10:06 posthog-bot

🧠 AI eval results

Evaluated 9 experiments, comprising 25 metrics.

funnel

🆕 QueryKindSelection: 73.33% 🆕 plan_correctness: 60.58% 🆕 query_and_plan_alignment: 81.11% 🆕 time_range_relevancy: 95.22%

Avg. case performance: ⏱️ 188.89 s, 🔢 4941 tokens, 💵 $0.0137 in tokens

memory

🆕 ToolRelevance: 98.13% 🆕 memory_content_relevance: 91.43%

Avg. case performance: ⏱️ 5.69 s, 🔢 1214 tokens, 💵 $0.0034 in tokens

retention

🆕 QueryKindSelection: 66.67% 🆕 plan_correctness: 33.33% 🆕 query_and_plan_alignment: 75.83% 🆕 time_range_relevancy: 100.00%

Avg. case performance: ⏱️ 46.41 s, 🔢 3421 tokens, 💵 $0.0188 in tokens

root

🆕 ToolRelevance: 61.93%

Avg. case performance: ⏱️ 5.29 s, 🔢 0 tokens

sql

🟢 QueryKindSelection: 100.00%, +16.67% versus baseline (master) (improvements: 1, regressions: 0) 🔴 plan_correctness: 42.41%, -4.26% versus baseline (master) (improvements: 4, regressions: 3) 🟢 query_and_plan_alignment: 89.67%, +24.67% versus baseline (master) (improvements: 3, regressions: 0) 🟢 retry_efficiency: 95.06%, +2.47% versus baseline (master) (improvements: 2, regressions: 1) 🔵 sql_syntax_correctness: 100.00%, ±0.00% versus baseline (master) (improvements: 0, regressions: 0) 🔴 time_range_relevancy: 96.67%, -3.33% versus baseline (master) (improvements: 0, regressions: 0)

Avg. case performance: ⏱️ 51.40 s, 🔢 962 tokens, 💵 $0.0041 in tokens

tool_generate_hogql_query

🆕 EmbeddingSimilarity: 78.25% 🆕 sql_syntax_correctness: 100.00%

Avg. case performance: ⏱️ 1.65 s, 🔢 1411 tokens, 💵 $0.0031 in tokens

trends

🆕 QueryKindSelection: 80.00% 🆕 plan_correctness: 77.71% 🆕 query_and_plan_alignment: 81.00% 🆕 time_range_relevancy: 96.25%

Avg. case performance: ⏱️ 67.84 s, 🔢 10240 tokens, 💵 $0.0220 in tokens

ui_context_actions

🆕 ToolRelevance: 68.01%

Avg. case performance: ⏱️ 4.36 s, 🔢 0 tokens

ui_context_events

🆕 ToolRelevance: 67.33%

Avg. case performance: ⏱️ 3.07 s, 🔢 0 tokens

Triggered by this commit.

Jul 08 '25 19:07 posthog-bot

🧠 AI eval results

Evaluated 9 experiments, comprising 25 metrics.

funnel

🆕 QueryKindSelection: 76.00% 🆕 plan_correctness: 69.92% 🆕 query_and_plan_alignment: 82.90% 🆕 time_range_relevancy: 96.50%

Avg. case performance: ⏱️ 60.54 s, 🔢 6239 tokens, 💵 $0.0156 in tokens

memory

🆕 ToolRelevance: 88.20% 🆕 memory_content_relevance: 80.48%

Avg. case performance: ⏱️ 5.07 s, 🔢 1302 tokens, 💵 $0.0036 in tokens

retention

🆕 QueryKindSelection: 75.00% 🆕 plan_correctness: 11.67% 🆕 query_and_plan_alignment: 100.00% 🆕 time_range_relevancy: 100.00%

Avg. case performance: ⏱️ 47.47 s, 🔢 2841 tokens, 💵 $0.0234 in tokens

root

🆕 ToolRelevance: 60.58%

Avg. case performance: ⏱️ 6.60 s, 🔢 0 tokens

sql

🆕 QueryKindSelection: 100.00% 🆕 plan_correctness: 41.48% 🆕 query_and_plan_alignment: 82.00% 🆕 retry_efficiency: 98.77% 🆕 sql_syntax_correctness: 100.00% 🆕 time_range_relevancy: 94.33%

Avg. case performance: ⏱️ 39.94 s, 🔢 797 tokens, 💵 $0.0034 in tokens

tool_generate_hogql_query

🆕 EmbeddingSimilarity: 72.54% 🆕 sql_syntax_correctness: 100.00%

Avg. case performance: ⏱️ 7.55 s, 🔢 1438 tokens, 💵 $0.0033 in tokens

trends

🆕 QueryKindSelection: 43.48% 🆕 plan_correctness: 90.83% 🆕 query_and_plan_alignment: 64.57% 🆕 time_range_relevancy: 96.74%

Avg. case performance: ⏱️ 45.41 s, 🔢 4945 tokens, 💵 $0.0108 in tokens

ui_context_actions

🆕 ToolRelevance: 56.84%

Avg. case performance: ⏱️ 8.67 s, 🔢 0 tokens

ui_context_events

🆕 ToolRelevance: 67.38%

Avg. case performance: ⏱️ 7.45 s, 🔢 0 tokens

Triggered by this commit.

Jul 15 '25 15:07 posthog-bot

This issue has 2052 words at 13 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

Write some code and submit a pull request! Code wins arguments
Have a sync meeting to reach a conclusion
Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

Jul 15 '25 15:07 posthog-contributions-bot[bot]

🧠 AI eval results

Evaluated 9 experiments, comprising 25 metrics.

funnel

🆕 QueryKindSelection: 68.00% 🆕 plan_correctness: 71.58% 🆕 query_and_plan_alignment: 77.50% 🆕 time_range_relevancy: 95.00%

Avg. case performance: ⏱️ 59.62 s, 🔢 5837 tokens, 💵 $0.0151 in tokens

memory

🆕 ToolRelevance: 91.56% 🆕 memory_content_relevance: 85.24%

Avg. case performance: ⏱️ 12.04 s, 🔢 1303 tokens, 💵 $0.0036 in tokens

retention

🆕 QueryKindSelection: 85.71% 🆕 plan_correctness: 37.67% 🆕 query_and_plan_alignment: 77.14% 🆕 time_range_relevancy: 100.00%

Avg. case performance: ⏱️ 73.54 s, 🔢 4658 tokens, 💵 $0.0194 in tokens

root

🆕 ToolRelevance: 60.33%

Avg. case performance: ⏱️ 4.70 s, 🔢 0 tokens

sql

🆕 QueryKindSelection: 100.00% 🆕 plan_correctness: 39.44% 🆕 query_and_plan_alignment: 82.00% 🆕 retry_efficiency: 96.30% 🆕 sql_syntax_correctness: 96.67% 🆕 time_range_relevancy: 92.00%

Avg. case performance: ⏱️ 41.41 s, 🔢 985 tokens, 💵 $0.0044 in tokens

tool_generate_hogql_query

🆕 EmbeddingSimilarity: 77.83% 🆕 sql_syntax_correctness: 100.00%

Avg. case performance: ⏱️ 2.79 s, 🔢 1414 tokens, 💵 $0.0031 in tokens

trends

🆕 QueryKindSelection: 47.83% 🆕 plan_correctness: 90.42% 🆕 query_and_plan_alignment: 65.65% 🆕 time_range_relevancy: 96.74%

Avg. case performance: ⏱️ 51.45 s, 🔢 5743 tokens, 💵 $0.0124 in tokens

ui_context_actions

🆕 ToolRelevance: 68.53%

Avg. case performance: ⏱️ 2.84 s, 🔢 0 tokens

ui_context_events

🆕 ToolRelevance: 67.52%

Avg. case performance: ⏱️ 2.53 s, 🔢 0 tokens

Triggered by this commit.

Jul 15 '25 16:07 posthog-bot

🧠 AI eval results

Evaluated 9 experiments, comprising 25 metrics.

funnel

🔴 QueryKindSelection: 74.47%, -25.53% versus baseline (master) (improvements: 0, regressions: 7) 🔴 plan_correctness: 66.00%, -22.25% versus baseline (master) (improvements: 3, regressions: 13) 🔴 query_and_plan_alignment: 82.45%, -12.55% versus baseline (master) (improvements: 6, regressions: 7) 🟢 time_range_relevancy: 95.85%, +5.02% versus baseline (master) (improvements: 4, regressions: 2)

Avg. case performance: ⏱️ 75.84 s, 🔢 6225 tokens, 💵 $0.0159 in tokens

memory

🟢 ToolRelevance: 96.17%, +4.55% versus baseline (master) (improvements: 1, regressions: 2) 🟢 memory_content_relevance: 88.57%, +3.33% versus baseline (master) (improvements: 1, regressions: 1)

Avg. case performance: ⏱️ 4.92 s, 🔢 1301 tokens, 💵 $0.0036 in tokens

retention

🔴 QueryKindSelection: 50.00%, -50.00% versus baseline (master) (improvements: 0, regressions: 3) 🔴 plan_correctness: 24.33%, -46.67% versus baseline (master) (improvements: 0, regressions: 5) 🟢 query_and_plan_alignment: 90.83%, +31.90% versus baseline (master) (improvements: 3, regressions: 0) 🟢 time_range_relevancy: 100.00%, +3.57% versus baseline (master) (improvements: 0, regressions: 0)

Avg. case performance: ⏱️ 56.36 s, 🔢 2738 tokens, 💵 $0.0158 in tokens

root

🔴 ToolRelevance: 60.78%, -14.82% versus baseline (master) (improvements: 0, regressions: 3)

Avg. case performance: ⏱️ 10.04 s, 🔢 0 tokens

sql

🟢 QueryKindSelection: 100.00%, +10.53% versus baseline (master) (improvements: 1, regressions: 0) 🔴 plan_correctness: 35.37%, -16.23% versus baseline (master) (improvements: 2, regressions: 3) 🟢 query_and_plan_alignment: 78.57%, +13.31% versus baseline (master) (improvements: 5, regressions: 0) 🟢 retry_efficiency: 93.83%, +3.16% versus baseline (master) (improvements: 2, regressions: 1) 🟢 sql_syntax_correctness: 100.00%, +2.94% versus baseline (master) (improvements: 0, regressions: 0) 🟢 time_range_relevancy: 93.93%, +4.45% versus baseline (master) (improvements: 3, regressions: 0)

Avg. case performance: ⏱️ 47.01 s, 🔢 1124 tokens, 💵 $0.0050 in tokens

tool_generate_hogql_query

🟢 EmbeddingSimilarity: 83.40%, +11.23% versus baseline (master) (improvements: 2, regressions: 0) 🔵 sql_syntax_correctness: 100.00%, ±0.00% versus baseline (master) (improvements: 0, regressions: 0)

Avg. case performance: ⏱️ 4.47 s, 🔢 1679 tokens, 💵 $0.0038 in tokens

trends

🔴 QueryKindSelection: 66.67%, -24.64% versus baseline (master) (improvements: 1, regressions: 5) 🟢 plan_correctness: 94.38%, +8.96% versus baseline (master) (improvements: 3, regressions: 1) 🔵 query_and_plan_alignment: 77.08%, +0.56% versus baseline (master) (improvements: 4, regressions: 3) 🔴 time_range_relevancy: 94.79%, -1.95% versus baseline (master) (improvements: 1, regressions: 1)

Avg. case performance: ⏱️ 51.22 s, 🔢 7708 tokens, 💵 $0.0159 in tokens

ui_context_actions

🟢 ToolRelevance: 57.48%, +10.57% versus baseline (master) (improvements: 1, regressions: 1)

Avg. case performance: ⏱️ 3.28 s, 🔢 0 tokens

ui_context_events

🔴 ToolRelevance: 67.57%, -4.27% versus baseline (master) (improvements: 1, regressions: 2)

Avg. case performance: ⏱️ 3.69 s, 🔢 0 tokens

Triggered by this commit.

Jul 16 '25 16:07 posthog-bot