telemetry-analysis-service icon indicating copy to clipboard operation
telemetry-analysis-service copied to clipboard

Count all the things

Open robhudson opened this issue 7 years ago • 1 comments

Instrument the code (either w statsd calls that we can send to Datadog or log items that will land in ES/Kibana) so we can generate dashboards and possibly notifications.

This is a follow-up to #418.

List of possible metrics (area of concern: CLUSTER = on-demand clusters, SPARKJOB = scheduled Spark jobs):

  • [x] cluster-normalized-instance-hours: Normalized instance hours of clusters (time between creation and finish multiplied by cluster size)
  • [x] cluster-ready: Number of on-demand clusters spun up successfully (to see trends in usage)
  • [x] cluster-extension: Number of cluster lifetime extensions
  • [ ] CLUSTER / SPARKJOB Number of AWS API error responses and which kind (e.g. throttling exception)
  • [ ] CLUSTER / SPARKJOB Number of Python errors/exceptions via Sentry (to see code regressions)
  • [ ] CLUSTER / SPARKJOB Number of bootstrapping failures during cluster start up (to track issues with EMR bootstrap script)
  • [x] cluster-time-to-ready / sparkjob-time-to-ready: Time between cluster creation (for both scheduled Spark jobs and on-demand clusters) and its readiness to process the first step (the "bootstrapping time" from the user perspective)
  • [x] cluster-emr-version / sparkjob-emr-version: EMR version used for cluster
  • [x] sparkjob-run-time: the time between the cluster's readiness to process the first step and the time when the cluster is shudown (the "runtime of the notebook code" from the user perspective)
  • [x] sparkjob-normalized-instance-hours: Normalized instance hours of scheduled jobs

robhudson avatar Jun 02 '17 22:06 robhudson

For the Python errors, we discussed this in IRC:

34:49 <•jezdez> so raven has processors: https://github.com/getsentry/raven-python/blob/master/raven/processors.py
13:49 <•jezdez> which are called whenever an error happens
13:49 <•jezdez> I think we could have one that listens for botocore exceptions and we can record them
13:50 <•jezdez> you can configure the sentry client with the processors to use our custom processor
13:50 <•jezdez> https://docs.sentry.io/clients/python/advanced/#client-arguments
13:52 <•jezdez> that would work for both celery and wsgi
13:52 <•jezdez> since the processors are called for either
13:53 <•jezdez> you'd have to be careful with database transactions during the calls

robhudson avatar Sep 25 '17 21:09 robhudson