charts-clickhouse
charts-clickhouse copied to clipboard
Telemetry improvements
Proposed change
We currently generate some telemetry from Helm install/update operations to help us better understand trends and app/k8s/chart release versions in use.
I’m opening this issue to improve the current setup in order to raise quality and quantity of the signals we get from those events.
Here is a list of some improvements I have in mind:
-
leverage Helm
pre-
andpost-
hooks to track- when an installation/update starts but doesn’t end successfully (KR: update/install success/failure rate)
- how long does the operation take (KR: p50/p75/p99 duration time)
- ...
-
make this event telemetry collection optional (and add a note to
README.md
) -
….
Alternative options
Do nothing
Additional context
See helm_install
events in PostHog & HELM_INSTALL_INFO
in this repo
@tiina303 you shared an idea for an interim metric (using existing telemetry) around success rate of installs...wonder if you could share it here too?
It was during the product exercise How many people fail to deploy a self-hosted instance? https://app.posthog.com/insights/rqjfOxEj idea: look at a funnel for helm install tied to a hostname -> organization status report for that hostname => can use unique instance, yay for group analytics. problems:
- the organization report is sent daily, so I need some way to limit the first query to only show older than 1 or 2 days, but keep the second one to be able to show later events too. => this is actually a problem for all conversions (e.g. the install started might have happened just 2 sec ago and hasn't ended yet that said shorter time intervals are much less impacted).
- people might have uninstalled in the middle & might have needed multiple installs. On the other hand that would be good to measure as well, so we could have a funnel:
- start install
- end install
- org status report
- org status report on day 5 (did they keep it running)?
Thanks @tiina303.
I was wondering if this retention view Is potentially another good way to measure it (in the interim) - given it enforces the time between helm install and org status report
Maybe we should just start from the org status report & check the retention https://app.posthog.com/insights/wct4ybhr & keep an eye on it looks pretty good at the moment, but if we see bigger drops in first weeks or anything else odd we might want to jump to address it. Looks like relative to previous period might be broken (https://github.com/PostHog/posthog/issues/8366) and that would potentially be better one to use.
One small tweak is to aggregate by instance (since the original query will also include cloud organizations). Yep I agree it looks pretty good (especially on a monthly basis):