telemetry-analysis-service
telemetry-analysis-service copied to clipboard
Count all the things
Instrument the code (either w statsd calls that we can send to Datadog or log items that will land in ES/Kibana) so we can generate dashboards and possibly notifications.
This is a follow-up to #418.
List of possible metrics (area of concern: CLUSTER = on-demand clusters, SPARKJOB = scheduled Spark jobs):
- [x]
cluster-normalized-instance-hours
: Normalized instance hours of clusters (time between creation and finish multiplied by cluster size) - [x]
cluster-ready
: Number of on-demand clusters spun up successfully (to see trends in usage) - [x]
cluster-extension
: Number of cluster lifetime extensions - [ ] CLUSTER / SPARKJOB Number of AWS API error responses and which kind (e.g. throttling exception)
- [ ] CLUSTER / SPARKJOB Number of Python errors/exceptions via Sentry (to see code regressions)
- [ ] CLUSTER / SPARKJOB Number of bootstrapping failures during cluster start up (to track issues with EMR bootstrap script)
- [x]
cluster-time-to-ready
/sparkjob-time-to-ready
: Time between cluster creation (for both scheduled Spark jobs and on-demand clusters) and its readiness to process the first step (the "bootstrapping time" from the user perspective) - [x]
cluster-emr-version
/sparkjob-emr-version
: EMR version used for cluster - [x]
sparkjob-run-time
: the time between the cluster's readiness to process the first step and the time when the cluster is shudown (the "runtime of the notebook code" from the user perspective) - [x]
sparkjob-normalized-instance-hours
: Normalized instance hours of scheduled jobs
For the Python errors, we discussed this in IRC:
34:49 <•jezdez> so raven has processors: https://github.com/getsentry/raven-python/blob/master/raven/processors.py
13:49 <•jezdez> which are called whenever an error happens
13:49 <•jezdez> I think we could have one that listens for botocore exceptions and we can record them
13:50 <•jezdez> you can configure the sentry client with the processors to use our custom processor
13:50 <•jezdez> https://docs.sentry.io/clients/python/advanced/#client-arguments
13:52 <•jezdez> that would work for both celery and wsgi
13:52 <•jezdez> since the processors are called for either
13:53 <•jezdez> you'd have to be careful with database transactions during the calls