posthog icon indicating copy to clipboard operation
posthog copied to clipboard

feat: Support nullable materialized columns using native types

Open tkaemming opened this issue 1 year ago β€’ 1 comments

Problem

Properties that are materialized as columns with the current implementation cannot distinguish between null values and the string value "null" and require processing when reading to lossily convert these values to nulls: https://github.com/PostHog/posthog/blob/cf39f4b37c0a5bf4734a7942ffdf7549f019f847/posthog/hogql/printer.py#L1350-L1354

Needing to transform these values when reading makes using data skipping indexes more difficult as the analyzer will not use the indexes for these expressions.

Additionally, property values are materialized as columns using the output of JSONExtractRaw with leading and trailing quotes stripped which can result in unnecessarily escaped values being contained within the stored.

Changes

Adds the ability to materialize properties into Nullable(String) types.

This uses the JSONExtract(column, key, 'Nullable(String)') form to avoid escaping errors:

SELECT JSONExtract('{"foo": "\\"hello\\""}', 'foo', 'Nullable(String)')

   β”Œβ”€JSONExtract('{"foo": "\\"hello\\""}', 'foo', 'Nullable(String)')─┐
1. β”‚ "hello"                                                          β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Compare to:

SELECT replaceRegexpAll(JSONExtractRaw('{"foo": "\\"hello\\""}', 'foo'), '^"|"$', '')

   β”Œβ”€replaceRegexpAll(JSONExtractRaw('{"foo": "\\"hello\\""}', 'foo'), '^"|"$', '')─┐
1. β”‚ \"hello\"                                                                      β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Nullable columns are only used within HogQL queries as the change could affect existing legacy queries in undefined ways due to the format change as well as the introduction as null values.

Does this work well for both Cloud and self-hosted?

Yes.

How did you test this code?

Added tests, updated existing tests, ran materialize_columns task locally and verified result.

tkaemming avatar Sep 30 '24 22:09 tkaemming

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.

posthog-bot avatar Oct 17 '24 07:10 posthog-bot

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.

posthog-bot avatar Oct 31 '24 07:10 posthog-bot

This is likely going to conflict pretty badly with #26068 - that change should take priority over this one.

tkaemming avatar Nov 08 '24 18:11 tkaemming

Not including these extra changes for now, see https://github.com/PostHog/posthog/pull/26158#issuecomment-2474505713


Planned changes:

  • [ ] Rebase on master once #26068 has been merged.
  • [ ] Update property groups to use a shared alias column containing a map of all properties to avoid JSON parsing multiple times: https://github.com/PostHog/posthog/pull/26158
  • [ ] Materialize (new, nullable) properties out of the ephemeral map rather than individual JSONExtract expressions.

tkaemming avatar Nov 12 '24 18:11 tkaemming

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.

posthog-bot avatar Nov 21 '24 07:11 posthog-bot

Rewriting this on top of #26414 which includes many improvements to managing materialization.

tkaemming avatar Nov 26 '24 23:11 tkaemming