feat: Support nullable materialized columns using native types
Problem
Properties that are materialized as columns with the current implementation cannot distinguish between null values and the string value "null" and require processing when reading to lossily convert these values to nulls: https://github.com/PostHog/posthog/blob/cf39f4b37c0a5bf4734a7942ffdf7549f019f847/posthog/hogql/printer.py#L1350-L1354
Needing to transform these values when reading makes using data skipping indexes more difficult as the analyzer will not use the indexes for these expressions.
Additionally, property values are materialized as columns using the output of JSONExtractRaw with leading and trailing quotes stripped which can result in unnecessarily escaped values being contained within the stored.
Changes
Adds the ability to materialize properties into Nullable(String) types.
This uses the JSONExtract(column, key, 'Nullable(String)') form to avoid escaping errors:
SELECT JSONExtract('{"foo": "\\"hello\\""}', 'foo', 'Nullable(String)')
ββJSONExtract('{"foo": "\\"hello\\""}', 'foo', 'Nullable(String)')ββ
1. β "hello" β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Compare to:
SELECT replaceRegexpAll(JSONExtractRaw('{"foo": "\\"hello\\""}', 'foo'), '^"|"$', '')
ββreplaceRegexpAll(JSONExtractRaw('{"foo": "\\"hello\\""}', 'foo'), '^"|"$', '')ββ
1. β \"hello\" β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Nullable columns are only used within HogQL queries as the change could affect existing legacy queries in undefined ways due to the format change as well as the introduction as null values.
Does this work well for both Cloud and self-hosted?
Yes.
How did you test this code?
Added tests, updated existing tests, ran materialize_columns task locally and verified result.
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label β otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label β otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.
This is likely going to conflict pretty badly with #26068 - that change should take priority over this one.
Not including these extra changes for now, see https://github.com/PostHog/posthog/pull/26158#issuecomment-2474505713
Planned changes:
- [ ] Rebase on master once #26068 has been merged.
- [ ] Update property groups to use a shared alias column containing a map of all properties to avoid JSON parsing multiple times: https://github.com/PostHog/posthog/pull/26158
- [ ] Materialize (new, nullable) properties out of the ephemeral map rather than individual
JSONExtractexpressions.
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label β otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.
Rewriting this on top of #26414 which includes many improvements to managing materialization.