gitpod icon indicating copy to clipboard operation
gitpod copied to clipboard

[jb] GW connections observability

Open akosyakov opened this issue 2 years ago • 2 comments

We should instrument JB GW plugin with observability of connection throughput and errors. For that we should add gitpod_jb_gw_connections counter with following labels:

  • status - exception name as an indicator of failure mode, i.e. e.getClass().getName(), concrete error should be logged or pushed to analytics
  • outcome - a more course-grained error category that separates success, user-caused error, and Gitpod-caused error, i.e. success|user-failure|gitpod-failure
  • product - kind of a product, i.e. 'intellij' or 'goland'
  • qualifier - stable or latest

It is not measurement of SSH connection only, but the whole connection operation and all exceptions whether it is bogusly provided params to connect function, or version compatibility mismatch between client and server.

This metric should be pushed to supervisor in production. During development we should provide HTTP endpoint to fetch current metrics, we should not push to supervisor metrics under development!

We should then add 2 graphs to Grafana:

  • throughput to understand usage: rate(gitpod_jb_gw_connections_total[2m])
  • error ratio to understand failures: sum(rate(gitpod_jb_gw_connections_total{outcome!="success"}[2m]))/sum(rate(gitpod_jb_gw_connections_total[2m]))

After that we should establish a baseline of failures and build an alert based on it. Additionally investigate with logs and analytics existing failures and bring them down.

akosyakov avatar Jun 14 '22 12:06 akosyakov

We discussed it again and using prometheus pushgateway on supervisor is not an option, since it is a point of failure itself in this scenario. Instead we could add a pushgateway to IDE proxy, but since we don't have it yet let's just use Mixpanel.

So then we need to add jb_gw_connection event to Mixpanel with following properties:

  • we should use anonymousId for identity and it should be stable machineId which does not change between restarts of GW
    • we don't use userId because we should start tracking when ConnectionProvider is called even before auth happens
  • status - exception name as an indicator of failure mode, i.e. e.getClass().getName()
  • reason - is a concrete exception message
  • outcome - more course-grained error category that separates success, user-caused error, and Gitpod-caused error, i.e. success|user-failure|gitpod-failure
  • duration - how long does it take to connect or fail
  • gitpodHost, workspaceId and instanceId to collect info about connection request

akosyakov avatar Jun 16 '22 08:06 akosyakov

We discussed with @felladrin to start very simple with one jb_gw_connection event without status, reason, outcome, duration, but onyl with connection metadata to indicate start of the connection. We can correlate this event in analytics with jb_session using instance_id.

akosyakov avatar Jul 05 '22 13:07 akosyakov

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 09 '22 07:11 stale[bot]