posthog Clarify the semantics of "query timeouts"

Background context

When talking about query performance, a topic that frequently comes up is "timeouts". However I argue that the current system of timeouts is hap-hazard and ill-specified.

We should clarify what we want out of "timeouts" in our product and how to action them.

Notes on the existing system

We've hard-coded max_execution_time=180 into our settings on US and EU cloud: 1, 2.
- This setting means that as soon as ClickHouse thinks that a query will take longer than 180s it will raise an error. Note that this estimation may (and has historically) misfired.
On self-hosted instances this same setting is unset.
The same setting is applied for both cache refreshes in celery and user interactions equally.
In the backend we have a 15 second timer for every clickhouse SQL query (not the same as an API endpoint load time) emitting a metric when query takes longer than 15 seconds (but not if it's killed earlier).
We've historically had issues in load balancer/gateway configurations: If query from a user takes longer than 60 seconds, the API call fails, causing users to see "something went wrong" error. This also happens on self-hosted instances.
In the frontend, after 15 seconds after an API call to insights is started, we show an animation for "things are a little slow"
In the frontend, if after 15 seconds the query fails sometimes we show "There was an error completing this query". In theory there's also a "Your query took too long to complete" response if response code was not 500, but I didn't manage to hit this case locally - I believe it depends on the gateway behavior described above.

Which of these is a "timeout"? Should we throw all of this out and rethink it all from scratch? I propose so.

Additional context

Not proposing a solution now - this is enough of a mess.

cc @lharries and @pauldambra - we need to clarify all of these semantics before we settle on any OKRs.

Dec 06 '22 09:12 macobo

Good question - I think we should define a timeout from the user side as this gives us maximum flexibility for the solution

So explicitly: a timeout is when a user starts a query but doesn't get a result back due to the query being too slow or clickhouse cancelling the query as it believes it will be too slow. The user very likely ends up seeing "There was an error completing this query"

If we don't already differentiate whether a query failed because it was too slow vs other potential causes of failure we'd likely want to track both, but for this OKR only focus on the failures due to things being too slow

Dec 06 '22 15:12 lharries

On a side note: A quick win I've seen is if a user sees "There was an error completing this query" we should likely guide them on how to reduce the query complexity to speed it up

In reproducing the timeout, I was able to "fix it" by shorting the time from "All time" to "past 180 days" - it's likely many users wouldn't consider e.g. they aren't familiar with SQL or aren't thinking about the mechanics under the hood

Dec 06 '22 15:12 lharries

@macobo how are we doing with deciding on this OKR? Would be great to get this finalized by EOD tomorrow (Thursday) if feasible

Dec 07 '22 22:12 lharries

Discussed sync and the conclusion: "Aligned that timeouts are very important and a cause of frustration - but we don't need to have this at an OKR level as will take a sprint to properly set up all the analytics for it"

Dec 08 '22 10:12 lharries

@macobo up to you if you want to keep this issue open or close it

Dec 08 '22 10:12 lharries

I'll keep this open - we should definitely fix the problems listed here in the upcoming sprints. I'd love to get help from someone more UI/UX focussed than I am though on it.

Dec 08 '22 10:12 macobo

Raw notes from chat with Paul:

Proposed query timeout semantics (cloud):
- We return a special http error code to frontend if clickhouse timeout occurs
    524 A Timeout Occurred similar to cloudflare
- We set `timeout_before_checking_execution_speed`
- For user-initiated queries, we:
    - timeout at 180s
    - run them with default (max) no of cores
    - We set `timeout_before_checking_execution_speed` to 30 seconds on all queries
        - Clickhouse can mis-estimate query execution times
- For cache refreshes, we:
    - timeout at 9 minutes
    - run the queries using 1/3rd of the max no of cores
    - Set `timeout_before_checking_execution_speed` to 90s
- On the frontend:
    - React to both 524, 408, and 504 errors with timeout screens. Track count of each
    - Show slow query warning after 15 seconds (if haven't received a timeout screen yet)
On self-hosted:
- Make all of this configurable.
- Default to no automatic timeouts, max thread configurations etc

Jan 09 '23 13:01 macobo