posthog icon indicating copy to clipboard operation
posthog copied to clipboard

refactor(active-users): improve performance of WAUs & MAUs query by reading less data

Open yakkomajuri opened this issue 1 year ago • 6 comments

Getting things rolling here. Mostly implementing, testing, and benchmarking a @macobo suggestion.

The WAUs / MAUs query (ACTIVE_USERS_SQL) seems to be a bit suboptimal in a few areas. One of those is that we're not aggregating directly in the relevant subquery (I'll look into this next). But another is that we were needlessly selecting data we didn't need to (and streaming that across nodes).

That was because the subquery was pulling in all distinct timestamps, rather than just the day, which is what we care about. Thus, we can make a simple change by immediately selecting with day granularity. This also required tweaking the intervals, since before we could have a timestamp 2023-01-01 23:59:00 and would need to look up e.g. 6 days into the future. Now that timestamp will equate to 2023-01-01 00:00:00 so we need to read 7 days into the future (and none into the past).

I suspect tests should catch if the new query behaves correctly?

Nevertheless, here are my bits of investigation as for why do this change:

  1. I ran the query for both our team and our largest team, getting the same results
  2. There was a ~30% reduction in memory usage and query duration on average:
Screenshot 2023-01-09 at 16 15 28
  1. The EXPLAIN output shows us the obvious: we're reading less data from other nodes:
# before: all unique timestamps in a day
ReadFromRemote (Read from remote replica)                                                                                                                                                                                      
    Header: toTimeZone(toDateTime(timestamp, 'UTC'), 'US/Pacific') DateTime('US/Pacific')                                                                                                                                          
        person_id UUID

# after: just one timestamp per day
ReadFromRemote (Read from remote replica)                                                                                                                                                                                          
    Header: toStartOfDay(toTimeZone(toDateTime(timestamp, 'UTC'), 'US/Pacific')) DateTime('US/Pacific')                                                                                                                                
        person_id UUID

Submitting this to get feedback and make sure I'm doing things right @macobo but let me know if I'm missing things, should test it more extensively, etc!

yakkomajuri avatar Jan 09 '23 19:01 yakkomajuri

Hmm seems I may have broken the hourly view with this. I guess the middle ground would be toStartOfHour? That's if we want to keep things simple I guess. Could also have this be dynamic...

yakkomajuri avatar Jan 09 '23 19:01 yakkomajuri

Actually looking more closely I think we need to keep intervals as they are but shift by one the interval range and that should be what we need. The static shift I wrote in worked for the specific query I was working with but doesn't apply generally

yakkomajuri avatar Jan 09 '23 19:01 yakkomajuri

Yeah the above should be it - need to consider how to correctly implement it though

yakkomajuri avatar Jan 09 '23 19:01 yakkomajuri

On benchmarking:

  • The result is IMO a bit inconclusive - benchmarking results under <1s in clickhouse can be a bit unreliable.
  • It would be great if you could show that result_size from the other shard decreased. This can be set by setting a custom log_comment you'd be filtering on.

On implementation/correctness: We have a pairing slot coming up, let's look at this then!

macobo avatar Jan 10 '23 11:01 macobo

Yup let's look at it then and discuss the best way to shift the interval - I have a solution but it's a bit messy.

As for "benchmarking", I'll take a look at the other shard explicitly, although I suspect querying on all replicas should have given me that? Nevertheless, I'm less looking at the query execution speed improvement but more the memory usage which should be less flaky I believe?

yakkomajuri avatar Jan 10 '23 12:01 yakkomajuri

As for "benchmarking", I'll take a look at the other shard explicitly, although I suspect querying on all replicas should have given me that?

No because the whole query doesn't get sent to the other shard, meaning your comment gets stripped out. You can check this by adding a is_initial_query column. I expect result_bytes would be gigabytes for the previous query.

Using log_comment will bypass this problem as this will get sent to the other shard.

macobo avatar Jan 10 '23 12:01 macobo