posthog refactor(active-users): wip refactor to active users query to remove CROSS JOIN

Putting this up to show what I've been working on although to be fair feels like I may have hit a dead end?

I broke down the active users query and saw that the intuition behind it was to grab all relevant events, grab all relevant buckets, cross join them, and then "iterate" each event to determine if it falls on the bucket.

Me and @macobo also determined in a pairing session that the bulk of the time spent in this query is not transferring data across shards, thus the JOIN is the likely suspect.

As such, I flipped it on its head, thinking it might be a worthy approach to get the relevant events and calculate the buckets from them, rather than get the buckets separately.

This seemed to yield great results, as I managed to get correct results on a query pulled from <our_biggest_customer>, and the performance was significantly better:

(almost 2x faster, at the expense of a bit more memory usage)

Here are the queries:

Before

SELECT groupArray(timestamp), groupArray(counts) FROM (
    SELECT 
        d.timestamp,
        COUNT(DISTINCT actor_id) AS counts
    FROM (
            SELECT toStartOfHour(
                    toDateTime('2023-01-10 12:54:58', 'UTC') - toIntervalHour(number)
                ) AS timestamp
            FROM numbers(
                    dateDiff(
                        'hour',
                        toStartOfHour(toDateTime('2022-12-27 00:00:00', 'UTC')),
                        toDateTime('2023-01-10 12:54:58', 'UTC')
                    )
                )
        ) d
        CROSS JOIN (
            SELECT toTimeZone(toDateTime(timestamp, 'UTC'), 'UTC') AS timestamp,
                e.distinct_id AS actor_id
            FROM events e
            WHERE team_id = <largest_customer_id>
                AND event = '$pageview'
                AND toDateTime(timestamp, 'UTC') >= toDateTime('2022-12-27 00:00:00', 'UTC')
                AND toDateTime(timestamp, 'UTC') <= toDateTime('2023-01-10 12:54:58', 'UTC')
                AND notEmpty(e.person_id)
            GROUP BY timestamp,
                actor_id
        ) e
    WHERE e.timestamp <= d.timestamp + INTERVAL 1 DAY
        AND e.timestamp > d.timestamp - INTERVAL 6 DAY
    GROUP BY d.timestamp
    ORDER BY d.timestamp
)

After

SELECT groupArray(bucket), groupArray(counts) FROM (
    SELECT 
        arrayJoin(timeSlots(
            toDateTime(ts - INTERVAL 1 DAY), 
            toUInt32(3600 * 24 * 7), 
            3600
        )) as bucket, 
        count(DISTINCT actor_id) AS counts
    FROM (
        SELECT 
            distinct_id AS actor_id,
            toStartOfHour(toTimeZone(toDateTime(timestamp, 'UTC'), 'UTC')) as ts
        FROM events e
        WHERE 
            team_id = <largest_customer_id>
            AND event = '$pageview'
            AND toDateTime(timestamp, 'UTC') >= (toDateTime('2022-12-27 00:00:00', 'UTC'))
            AND toDateTime(timestamp, 'UTC') <= toDateTime('2023-01-10 12:54:58', 'UTC')
            AND notEmpty(e.person_id)
    )
    WHERE bucket >= toDateTime('2022-12-27 00:00:00', 'UTC') AND bucket < toDateTime('2023-01-10 12:54:58', 'UTC')
    GROUP BY bucket
    ORDER BY bucket
)

I'm struggling to make this work on the codebase though, so will revisit with fresh eyes tomorrow and see if I didn't just get lucky & subsequently rabbit-holed.

Jan 10 '23 19:01 yakkomajuri

A lot of progress! Query is simpler now and most tests pass (compared to none earlier!)

Seems something is off for WAU but all MAU tests are passing. Will dig into it.

Jan 11 '23 16:01 yakkomajuri

Ok I know what the problem is - just need to find a way to fix it

Jan 11 '23 17:01 yakkomajuri

So there's a few tricky things here which I'll be working on today.

Mostly it's about rounding to the nearest week/month stuff. There are a few complications with deriving buckets from events rather than independently when it comes to this "rounding" step.

One particularly tricky one is that by hour actually looks at data with an hour granularity, whereas by week/by month simply pick a day and then look on a day basis. This is probably confusing to users when using by month. Either way, working on it.

Jan 12 '23 13:01 yakkomajuri

Good and bad news about the query exactly as is

The good news is that this is potentially up to 10x faster for most cases (while using less memory). Here's WAUs over 30 days with daily interval for our largest customer:

The bad news is that memory growth is a much steeper function, thus we exceed memory limits if there are a lot of buckets (I tried with 720 buckets i.e. 30 days hourly) for our largest customer:

There are some clear optimizations to use a lot less memory which I had already commented about in the code - at the expense of a bit more query complexity. Will try those out and see the impact. Else this might be prohibitively expensive?

Jan 12 '23 18:01 yakkomajuri

Although to be fair the 720 buckets sometimes doesn't even complete with the old query Screenshot 2023-01-12 at 16 28 41

Jan 12 '23 19:01 yakkomajuri

Did some benchmarking and not sure this query will cut it.

Its speed and memory usage depend both on the number of buckets (which we have more control over, see e.g. #13736) but also event volumes. Given that, it's quite fast for "average" cases, but doesn't improve things too much at the margin.

I ran the query with different numbers of buckets for our biggest customer using the $pageview event (high volume, but not the highest they have). Lastly I ran an average number of buckets for their highest volume event.

The results were loosely (didn't do something super formal as I kind of lost faith halfway through):

Insight description	Memory usage diff	Query speed diff
`$pageview`, 30 buckets	~2x more	~8-10x faster
`$pageview`, 60 buckets	~1.4x more	~10x faster
`$pageview`, 180 buckets	~3.5x more	~3x faster
`$pageview`, 360 buckets	~7.5x more	~7.5x faster
`$pageview`, 720 buckets	~30x more	~1.2x faster
custom high volume event, 60 buckets	~7x more	Same

Quick reference on buckets:

30 buckets = 30 days, grouped by day

720 buckets = 30 days, grouped by hour ...

Things are promising but given this introduces different failure modes and doesn't improve things where we need it most, I'd hold off for now. It also has become a bit complex on its own, whereas initially it tried to simplify the original query.

There are further optimizations that can be done here, like auditing all the timestamp conversions, and maybe just using a Uint16 in the array and only later converting to DateTime. There might also be a simpler way to infer buckets which I might want to think on.

For now I think I'll shelf this though, get a few quick wins elsewhere, get myself motivated :D and maybe revisit in the near future Screenshot 2023-01-16 at 16 26 44 Screenshot 2023-01-16 at 16 27 22

Jan 16 '23 19:01 yakkomajuri

Given memory trade-off is not that bad and this is a win in many cases, I'd encourage to get this merged rather than leaving an open PR.

Note you didn't check how this behaves with page cache vs without.

Jan 17 '23 07:01 macobo

Final benchmark clearing page cache each time.

Above green means the new query is better, red means it's worse, and yellow means it's better but wouldn't actually complete due to memory limits

Jan 17 '23 19:01 yakkomajuri

Should I merge @macobo ?

Jan 18 '23 11:01 yakkomajuri

posthog posthog copied to clipboard

refactor(active-users): wip refactor to active users query to remove CROSS JOIN

posthog
posthog copied to clipboard