good_job Add debugging and telemetry stats

Have been getting GH Issues recently that would benefit from some basic stats and system info. It should be easy for someone to dump them to help debug issues.

Imagining a method like GoodJob.debug to be run from a console, imagining the following fields to start:

GoodJob version
Ruby Version
Rails Version
Rails Env
Execution mode
Queues
Polling Interval
Notifier listening?
Processes table (consider: maybe this is all data that should be going into the Processes table)

Seems like it would also be useful to have some counters on a Scheduler.stats too:

Successfully executed jobs
Failed executed jobs
Job query misses (e.g. queried for a job but there were no executable jobs)

Feb 27 '22 20:02 bensheldon

The ability to configure some logging output here would go a long way. We're running in AWS as an ECS Service: loglines one an hour that showed data like either successful_jobs_counter=1234 or successful_jobs_since_last_logline=123 would let us use CloudWatch to build some of the metrics that we'd find handy.

Alternatively, a "log job run metrics" that let each job spit some metrics out to stdout like name=Job1, exit_code=1, queue_latency=1234ms,execution_duration=123ms would be great too.

Apr 03 '23 18:04 ckdake

Thanks for opening up #911. That is really helpful!

Also, I feel bad because now that I'm seeing what I wrote in this issue I'm like "hmm, what was I thinking?" 😓

Here are some thoughts that I want to bounce off of you. Sorry for the stream of consciousness:

Should these be incrementers or gauges? If they're gauges then there needs to be a mechanism to "flush/reset" the values every interval.
What should be triggering the stats dump? Should this be invoked from the command line, or should it be logged by GoodJob itself. If it's by GoodJob itself, should that log come from somewhere that already exists (for example, it gets logged every time the Scheduler tries to pick up a job) or would a de veloper overload an existing mechanism e.g. an ActiveJob extension, or via cron (which would need to take a Proc; currently it only allows jobs) or some new self-contained, threaded TimerSet object that would wake up to log stats. I don't feel very strongly either way, but each has its tradeoffs.
Then there is the content of the info of being dumped:
- Would this only be intended for processes that have GoodJob and are executing jobs (e.g. the CLI or async web processes), or any process that has the GoodJob constant?
- Schedulers are the smallest unit to track. A process that is executing jobs could have multiple schedulers. Though maybe the recently created Capsule is the correct unit to track. Though a capsule can have multiple schedulers, and a process could have multiple capsules if someone was initializing them manually. Maybe the Capsule is the thing that has stats on it, and multiple schedulers increment a shared Stats object owned by the Capsule. (I'm imagining in this case the stats are like "number of jobs finished/errored/discarded/retried since creation"
- Scheduler#stats has some good stuff. I wonder if that would be a thing to extend, and also inserted into the Process.current_state. That would definitely surface a lot more details... though the Process state doesn't get flushed to the database currently except at startup.

Apr 03 '23 22:04 bensheldon

Telemetry is trick and a lot to get right. There is getting the metrics right, and exposing them. Here's a stream of consciousness back for you.

A lot of good work has been done by other people on other queue tools, and the "export metrics to Prometheus" is a common pattern. https://www.rabbitmq.com/monitoring.html#rabbitmq-metrics is a great example. It's probably not a word-for-word copy, but good_job seems to basically have the "Cluster" and "Node" and "Queue" pieces that rabbitmq does.

From the rabbitmq docs:

Most of the metrics represent point-in-time absolute values. Some, represent activity over a recent period of time (for example, GC runs and bytes reclaimed). The latter metrics are most useful when compared to their previous values and historical mean/percentile values.

I'd expect "current" metrics to be available from the processes within GoodJob (including both counters and gauges), but I would want to have "Write metrics all the time to the database" turned off to keep my database traffic down. (I know you have some other open issues about performance in large/high-volume installations of good_job -> I'd avoid making that worse with metrics)

Once you've got the metrics figured out, there are a couple of ways to expose them

Exporting to a prometheus compatible endpoint that prometheus can poll would probably make a lot of people happy. When something polls that endpoint, it would calculate what it needs to (e.g. the 'flush' as needed) and return the results.
Supporting writing the metrics on a scheule, via a cron (maybe as a job itself?), to stdout would make an overlapping set of people happy. This would need to be configurable, and because time is hard, including a timestamp with the metrics would be important and the "Exact" moment the log line written, less so. It wouldn't be great for these to get bogged down in situations with high load/throughput/queue-pressure so maybe not "as a job".

Scheduler#stats and Process.current_state both look interesting. I'm mainly interested in queue depth and latency for "dashboard" level graphs, and some of those more detailed things would be helpful for diagnosing problems.

As I mentioned, we're running CloudWatch off of stdout logs, so if GoodJob goes the prometheus route I'd have to wire up something to get from that endpoint into CloudWatch.

The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?

Apr 04 '23 12:04 ckdake

The easiest 'v0' starting point may be to wire up the existing metrics to a prometheus compatable http endpoint in the frontend piece. Maybe these should actually be available from GoodJob without the "GUI"?

I'd prefer instead to build the minimum-satisficeable solution that solves your needs. People who need Prometheus can ask for it and be involved in that design.

I'm imagining that the data that you would need for cloudwatch would be the same metrics for a Prometheus endpoint, so I think the steps here would be:

Identify the metrics
Figure out the implementation for metrics.

If they require storage or are heavy to fetch (e.g. making database calls), I might ask whether there is an alternative.
I'm imagining they would need to be process-local, and the metrics system would need to sum/aggregate them to generate the cluster view.

Then identify how to do the transport (e.g. logging, emitting, fetching)

Apr 12 '23 14:04 bensheldon

Yep, that all sounds perfectly reasonable. The metrics I'm interested in are essentially/currently the cluster-wide metrics that rabbitmq exposes here: https://www.rabbitmq.com/monitoring.html#cluster-wide-metrics We're using a single queue, but once we add more queues the "Individual Queue Metrics" will be interesting.

From poking around good_job source, most of these could probably each just be a single SQL query (maybe with some index changes?) that is run win .metrics() is called, and for me: the ability to schedule (every 5 minutes?) the contents of .metrics() getting logged to stdout would do the trick.

Apr 12 '23 17:04 ckdake

@ckdake could you pick 3 metrics from that list to start with? That would help me scope it down to an end-to-end implementation.

Also, some idle musings on this topic:

A GoodJob scheduler (a pool of execution threads) can pull from multiple queues, either *, or queue1,queue2 or negative matching *,-queue2. Trying to count "capacity" at the individual queue-level could be hard. (I'm somewhat haunted by trying to explain this in #624)
I think counting up jobs that are assigned to individual queues should be simple (though maybe slightly more complicated by #928). I think the core ops question here is "are jobs being assigned to a queue that has zero executors"

To give an example:

Here's one of my apps that has:

2 processes.
Each process runs 2 threadpools, each with:
- 2 threads to pull from the whenever queue (high latency)
- 2 threads to pull from any queue except whenever (low latency)

So that means I have total:

whenever 4 threads
4 everything else threads? I think that's the complicated-ness I'm wondering about.

Apr 21 '23 15:04 bensheldon

A great start would be letting me answer "how many items have been added to this queue?" "how many items are in this queue?" and "how many items have been processed in this queue?". A helpful next question would be "how long does it take to process each item in this queue?" and "are there any stuck jobs?" (count of individual jobs "older" than some threshold).

Apr 21 '23 16:04 ckdake

Thanks. I think that's 5 😆 And apologies for being super pedantic here, but want to surface the design questions to work through 😄

"how many items are in this queue?" This seems simplest to break down. I assume "in this queue" is specifically "waiting to execute", and would exclude future-scheduled jobs. Is that right?
"how many items have been added to this queue?" What would be the (default) time-period for this query? e.g. "how many items have been added to this in [in the last N minutes; since T]?"
- Would future-scheduled jobs be considered "added" at their scheduled time, or when created?
- In terms of "items" there is a small difference between jobs and retries/executions, because a job might be initially enqueued and execute on one queue, then retry and execute multiple more times on another queue. From the operational/scheduler perspective, it doesn't matter (they're all items to execute) but that might not be the mental model of the person enqueueing them. e.g. items in queues >= jobs enqueued. Would you need to distinguish between "Jobs enqueued" and "jobs+retries enqueued?"
"how many items have been processed in this queue?" Similar to the previous query.
"how long does it take to process each item in this queue?"
- total_latency = queuing_latency + runtime_latency. Do you mean total_latency? or breakdown (queuing latency is useful for identifying a lack of capacity; total latency is useful for calculating capacity for meeting SLOs)
- I assume you mean this as a calculation across multiple items. So an average, or p95, or max?
- Same question of previous query, of "over what time period?"
- Something that has bedeviled me on latency is how to handle null values. e.g. if a job is in queue but has not started executing should I exclude the record from the calculation, or should I calculate it as COALESCE(performed_at, NOW()) - scheduled_at.
"are there any stuck jobs?" I think this is interesting/complicated.
- I wonder if the previous question would be equivalently alertable e.g. "What is the MAX runtime latency?" and then you could set an alert threshold in your tool.

Apr 21 '23 17:04 bensheldon

Its a fifo queue, so the first 3 would work :)

for "how many in the queue". We are not (currently) using good_job to schedule jobs to happen at a particular time. It is the "waiting to execute" that will be most interesting there for starters for me. (gauge metric)
I'm expecting "how many added" to be a total counter metric, increasing as each job is added. Not coupled to a timeperiod. Given that we're processing 100s of jobs and not 1000000000s of jobs, this is fine for our use case.
- Future scheduled jobs would be entirely separate, and only counted here when they are queued for execution.
- "retries" is a whole additional level of metrics. I would at some point like to know error/retry rates, but thats probably round 2
I'm expecting "how many processes" to be a total counter too
"latency"-> good question, I'm interested in all of the pieces, a good first version would be "time from when job is enqueued for execution until job is completed". Most of our jobs are super quick and are instrumented already with logs for job started -> job finished, so the piece good_job knows is how long they were queued before starting. mean,p95,max would all be helpful, and this would need a time range. if that time range approximately matched the polling interval of the monitoring tool, that might be ideal. Many tools use a very short polling interval (prometheus can do continual scraping, and requires special consideration over 2 minutes to avoid gaps in graphs), but I'd be good with a 5min rolling window. I'd expect the lantency to be for "completed jobs", so for you null example, that wouldn't be included in this metric. If the queue is piling up and not executing, I'd expect that to land in the first two metrics as a visible trend.
Stuck. Yep, complicated to get right, but alerting on the max individual latency would do it. I'm not sure a great design to surface ALL latencies for ALL jobs from good_job to a monitoring tool, but surfacing a max_latency_in_past_X_timeframe from good_job could work

Apr 25 '23 14:04 ckdake

Not sure if this is much help, but I just glued this together from the scopes on GoodJob::Job to give a summary count for the jobs on each queue.

WITH cte_statuses AS (
  SELECT
    CASE
      WHEN finished_at IS NOT NULL
        THEN (CASE WHEN error IS NULL THEN 'succeeded' ELSE 'discarded' END)
      WHEN COALESCE(scheduled_at, created_at) <= NOW()
        THEN (CASE WHEN pg_locks.locktype IS NULL THEN 'queued' ELSE 'running' END)
      WHEN (serialized_params ->> 'executions')::integer > 1
        THEN 'retried'
      ELSE
        'scheduled'
    END AS status,
    queue_name,
    priority
  FROM good_jobs
    LEFT JOIN pg_locks ON pg_locks.locktype = 'advisory'
    AND pg_locks.objsubid = 1
    AND pg_locks.classid = ('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(32)::int
    AND pg_locks.objid = (('x' || substr(md5('good_jobs' || '-' || active_job_id::text), 1, 16))::bit(64) << 32)::bit(32)::int
  WHERE retried_good_job_id IS NULL
)
SELECT status_list.status, queues.queue_name, COUNT(cte_statuses.status)
FROM 
  (VALUES ('succeeded'), ('discarded'), ('queued'), ('running'), ('retried'), ('scheduled')) AS status_list(status)
  CROSS JOIN (SELECT DISTINCT queue_name FROM good_jobs) AS queues
  LEFT JOIN cte_statuses ON cte_statuses.status = status_list.status
GROUP BY
  status_list.status,
  queues.queue_name;

The results get fed into metrics in our monitoring tool for pattern analysis, reporting, alerting, etc... That plus the duration information we can parse out the ActiveJob logs gets us what we need.

May 18 '23 18:05 jtannas

@jtannas fyi, rubygems.org does an interesting thing using the GoodJob::JobsFilter's queries: https://github.com/rubygems/rubygems.org/blob/63f6ba818aaa21138a82c7d57da8380d261c5193/app/jobs/good_job_statsd_job.rb#L4-L10

I want to share it, but also warn that I still consider those private objects, so subject to change and write a test 😅

May 18 '23 19:05 bensheldon

@bensheldon It would be pretty useful to keep track of how many times a given job has been executed and even more, count retries and errors on a per job basis. https://github.com/bensheldon/good_job/pull/984 added overall counts. What do you think about incorporating this into metrics?

Jul 07 '23 16:07 AndersGM

I think we can use ActiveJob's instrumentation events for some metrics. For example, I believe execution time can be measured via perform.active_job. Because GoodJob is designed for ActiveJob, maybe they can be utilized?

https://guides.rubyonrails.org/active_support_instrumentation.html#active-job.

By the way, I wanted to expose metrics to Prometheus but realized that it's not so straightforward. Metrics such as execution time should be measured per process, so they can be scraped by Prometheus. But metrics such as queue length are global. It's strange to return those metrics from the individual process. An exporter may be needed for those metrics.

Sep 15 '23 04:09 shouichi

Just dropping in to add we're also investigating options for reporting job latency & queued count, per queue, to cloudwatch in the interest of alerting and autoscaling. for our use case, the latency of the last started job in the queue is fine.

for a little context, we run a few different queues (critical, high, default, deferrable) and we really want to know if our critical queue is getting behind by more than a few minutes, and we'd like for our deferrable queue to just be permanently growing, if that makes sense.

Sep 21 '23 10:09 bagedevimo

:bulb: I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics.

Nov 25 '23 01:11 simi

💡 I just noticed https://github.com/discourse/prometheus_exporter#goodjob-metrics.

Just to note, this isn't in the latest release version of prometheus_exporter (2.0.8).

Dec 27 '23 07:12 thatsmydoing

good_job good_job copied to clipboard

Add debugging and telemetry stats

good_job
good_job copied to clipboard