charts-clickhouse icon indicating copy to clipboard operation
charts-clickhouse copied to clipboard

Telemetry improvements

Open guidoiaquinti opened this issue 3 years ago • 5 comments

Proposed change

We currently generate some telemetry from Helm install/update operations to help us better understand trends and app/k8s/chart release versions in use.

I’m opening this issue to improve the current setup in order to raise quality and quantity of the signals we get from those events.

Here is a list of some improvements I have in mind:

  1. leverage Helm pre- and post- hooks to track

    1. when an installation/update starts but doesn’t end successfully (KR: update/install success/failure rate)
    2. how long does the operation take (KR: p50/p75/p99 duration time)
    3. ...
  2. make this event telemetry collection optional (and add a note to README.md)

  3. ….

Alternative options

Do nothing

Additional context

See helm_install events in PostHog & HELM_INSTALL_INFO in this repo

guidoiaquinti avatar Jan 27 '22 13:01 guidoiaquinti

@tiina303 you shared an idea for an interim metric (using existing telemetry) around success rate of installs...wonder if you could share it here too?

marcushyett-ph avatar Jan 27 '22 13:01 marcushyett-ph

It was during the product exercise How many people fail to deploy a self-hosted instance? https://app.posthog.com/insights/rqjfOxEj idea: look at a funnel for helm install tied to a hostname -> organization status report for that hostname => can use unique instance, yay for group analytics. problems:

  • the organization report is sent daily, so I need some way to limit the first query to only show older than 1 or 2 days, but keep the second one to be able to show later events too. => this is actually a problem for all conversions (e.g. the install started might have happened just 2 sec ago and hasn't ended yet that said shorter time intervals are much less impacted).
  • people might have uninstalled in the middle & might have needed multiple installs. On the other hand that would be good to measure as well, so we could have a funnel:
  1. start install
  2. end install
  3. org status report
  4. org status report on day 5 (did they keep it running)?

tiina303 avatar Jan 31 '22 14:01 tiina303

Thanks @tiina303.

I was wondering if this retention view Is potentially another good way to measure it (in the interim) - given it enforces the time between helm install and org status report

marcushyett-ph avatar Jan 31 '22 15:01 marcushyett-ph

Maybe we should just start from the org status report & check the retention https://app.posthog.com/insights/wct4ybhr & keep an eye on it looks pretty good at the moment, but if we see bigger drops in first weeks or anything else odd we might want to jump to address it. Looks like relative to previous period might be broken (https://github.com/PostHog/posthog/issues/8366) and that would potentially be better one to use.

tiina303 avatar Jan 31 '22 18:01 tiina303

One small tweak is to aggregate by instance (since the original query will also include cloud organizations). Yep I agree it looks pretty good (especially on a monthly basis):

marcushyett-ph avatar Feb 02 '22 09:02 marcushyett-ph