posthog icon indicating copy to clipboard operation
posthog copied to clipboard

feat(hogql): support CTEs in UNION ALL

Open nikitaevg opened this issue 1 year ago • 5 comments

Problem

https://github.com/PostHog/posthog/issues/20663

JFYI, ClickHouse doesn't allow one to use CTEs from the first statement if a statement has its own CTE.

This doesn't work in CH

WITH cte1 AS (SELECT 1 AS a)
SELECT 1 AS a
UNION ALL
WITH cte2 AS (SELECT 1 AS a)
SELECT * FROM cte1

Changes

It adds CTEs from the initial statement to all statements in UNION ALL

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Automated tests

nikitaevg avatar Apr 27 '24 16:04 nikitaevg

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: posthog/hogql/resolver.py

Function Unhandled Issue
visit TypeError: sequence item 0: expected str instance, list found posthog.tasks.tasks.proce...
Event Count: 7

Did you find this useful? React with a 👍 or 👎

sentry-io[bot] avatar Apr 27 '24 16:04 sentry-io[bot]

Hi @thmsobrmlr! PTAL

nikitaevg avatar Apr 27 '24 17:04 nikitaevg

Thanks for the review @mariusandra

Hey thanks for this! I added one comment inline with regards to some optimizations. However the bigger question is: do we even need to do this cloning? Would it be better to update the lookup functions visit_field in resolver to also "go higher" and loop through all SelectQueryType-s in self.scopes, looking for CTEs that could match?

That won't work because selects from union all are processed sequentially and not recursively. So every select is processed only with its own scope, because the scope is popped after processing. I confirmed that by debugging as well.

nikitaevg avatar Apr 30 '24 17:04 nikitaevg

That won't work because selects from union all are processed sequentially and not recursively. So every select is processed only with its own scope, because the scope is popped after processing. I confirmed that by debugging as well.

We could add the SelectUnionQueryType also onto the scope, and visit the other queries that way. I checked briefly, and it seemed most (all?) things that access the last scope will only do it in a context where the last thing will be a SelectQueryType anyway, so this probably won't require a lot of rewiring.

mariusandra avatar May 01 '24 00:05 mariusandra

We could add the SelectUnionQueryType also onto the scope, and visit the other queries that way. I checked briefly, and it seemed most (all?) things that access the last scope will only do it in a context where the last thing will be a SelectQueryType anyway, so this probably won't require a lot of rewiring.

IIUC your main concern is that I'm rewriting the query while resolving types, which is not the best way to propagate context to the downstream expressions. I agree with that.

I'm not sure adding SelectUnionQueryType to the scopes is the right approach though, IMO it complicates the scopes semantics, mostly because now we have two types in the scopes. What I could do is to add an ephemeral SelectQueryType onto the scopes. This type would only have the CTEs of the first Select, it should work out of the box (see this commit what I mean, the tests do pass). There's one downside though, the following query will work in HogQL, but it will not work in ClickHouse

WITH first_cte AS (SELECT 1 AS a)
SELECT * FROM first_cte
UNION ALL
WITH overriding_cte AS (SELECT 1 AS a)
SELECT * FROM first_cte -- note - one can't use first_cte here, because overriding_cte overrides the first_cte

That will happen because resolver searches for CTEs in all scopes. I think it's not that big of an issue, WDYT?

nikitaevg avatar May 01 '24 17:05 nikitaevg

Your approach makes sense. We're actually doing something similar with lambdas as well.

It actually hints at a refactor we haven't taken on yet: I'd argue SelectQueryType has taken on too many responsibilities, and should split part of it to a new Scope class. Then self.scopes would actually be a list[Scope].

In that future world, you'd just create a new scope for the SelectUnionQueryType, and add it to the list with the rest of them. However for now, we're forced to use SelectQueryType, as a sort of scope. A refactor for this is definitely out of scope (bad pun), so what you're proposing makes total sense.

mariusandra avatar May 02 '24 09:05 mariusandra

bad pun

Nah, made me giggle:)

I'd argue SelectQueryType has taken on too many responsibilities, and should split part of it to a new Scope class

Yeah, I agree, this class looks much more like a Scope rather than SelectQueryType atm.

Anyway, I updated the PR, please take a look!

nikitaevg avatar May 02 '24 17:05 nikitaevg