zync icon indicating copy to clipboard operation
zync copied to clipboard

Provide internal stats for better monitoring/alerting

Open jmprusi opened this issue 8 years ago • 10 comments

While trying to monitor Zync, we can only rely on using the "/status/live" endpoint, which doesn't provide internal information...

So, my question:

  • Would be possible to provide internal stats (retries, requests, failed request, latencies...) ?

Some ideas:

  • Push stats to a "statsd" server.
  • Publish metrics in prometheus format (https://github.com/prometheus/client_ruby)
  • Publish stats in a internal socket ...

Thanks.

jmprusi avatar Aug 08 '17 12:08 jmprusi

So there are few stats about jobs available that can be exported easily:

Que.job_stats
=> [{"queue"=>"", "job_class"=>"ActiveJob::QueueAdapters::QueAdapter::JobWrapper", "count"=>36, "count_working"=>0, "count_errored"=>36, "highest_error_count"=>11, "oldest_run_at"=>2017-08-08 12:21:52 +0000}]

Que.worker_states # only when some work is being processed
=> [{"priority"=>100,
  "run_at"=>2017-08-08 12:23:02 +0000,
  "job_id"=>227,
  "job_class"=>"ActiveJob::QueueAdapters::QueAdapter::JobWrapper",
  "args"=>[{"job_class"=>"ProcessEntryJob", "job_id"=>"92f6a5eb-ce46-485c-9ff5-fa43e87286d7", "provider_job_id"=>nil, "queue_name"=>"default", "priority"=>nil, "arguments"=>[{"_aj_globalid"=>"gid://zync/Entry/91"}], "executions"=>0, "locale"=>"en"}],
  "error_count"=>0,
  "last_error"=>nil,
  "queue"=>"",
  "pg_backend_pid"=>45435,
  "pg_state"=>"idle",
  "pg_state_changed_at"=>2017-08-08 12:23:02 +0000,
  "pg_last_query"=>
   "SELECT a.attname\n" +
   "  FROM (\n" +
   "         SELECT indrelid, indkey, generate_subscripts(indkey, 1) idx\n" +
   "           FROM pg_index\n" +
   "          WHERE indrelid = '\"integrations\"'::regclass\n" +
   "            AND indisprimary\n" +
   "       ) i\n" +
   "  JOIN pg_attribute a\n" +
   "    ON a.attrelid = i.indrelid\n" +
   "   AND a.attnum = i.indkey[i.idx]\n" +
   " ORDER BY i.idx\n",
  "pg_last_query_started_at"=>2017-08-08 12:23:02 +0000,
  "pg_transaction_started_at"=>nil,
  "pg_waiting_on_lock"=>false}]

Everything that is not in database and lives only in memory would not be easy to export as it would live only in memory and for example could not use puma proxy mode which starts several processes.

Puma has control endpoint which can have stats, but no latency. Just number of workers running etc.

What metrics exactly are you interested in?

We have custom logger that in theory could aggregate some stats, but then there is the issue with running multiple processes in one pod and this would not work (without exporting to something like statsd).

mikz avatar Aug 08 '17 12:08 mikz

OpenShift is moving to use Prometheus.

The Innovation week project proved Prometheus to be the best option for us, and so we're moving our monitoring to be based on it too.

I think this is one of those issues where maybe it's not the most optimal locally for one project, but it's the best globally across our many projects and different infrastructure pieces (and is now starting to become part of the base platform all our stuff will run on).... so for standardization, making it "shared knowledge" and easing Operations... I'd like us to move to enable Prometheus monitoring on as many of our workloads as possible.

TBD "how". I don't really like linking in the monitoring solution into the application code (but that might end up being a necessary evil...).

If there was a way the export of the stats from the app could be fairly generic (a bunch of text and numbers in flat files, or in STDOUT....or something) that is independent from any particular monitoring solution (but could be picked up by an exported on the machine) then that would be idea....

If the idea solution can't be done, I think we should live with the necessary evil (maybe wrap the stats exporting code in a wrapper class to avoid polluting app code directly with Prometheus?) and make progress, enable monitoring while standardizing and easing our Ops lives.

andrewdavidmackenzie avatar Aug 08 '17 14:08 andrewdavidmackenzie

Would be nice to know:

  • Numbers of jobs running
  • Latencies per job
  • Ok per job
  • Retries per job
  • Failed per job

So we can create alerts, based on high latency, num of jobs failed, to many retries, etc.

If you have some total counters, I can "try" to extract some rate limits based on increase during time...

jmprusi avatar Aug 08 '17 14:08 jmprusi

@jmprusi looks like those stats can't really be extracted from what I pasted in https://github.com/3scale/zync/issues/42#issuecomment-320942790.

Guess only the "number of jobs running" as that is basically count of elements in the worker_states. Jobs that are successfully completed are removed from database.

Rest of the stats can be collected via our log subscribed internally which already logs all this info to the standard log.

The remaining issue is how to publish those. Using the local memory as the prometheus ruby client does is not compatible with the puma cluster mode.

We are not running the cluster mode right now, but possibly could in the future.

One issue I see with having it in local memory is if the process crashes or gets killed for whatever reason the information is lost.

I guess the easiest option for now is to just use local memory and investigate use of statsd in the future.

mikz avatar Aug 08 '17 15:08 mikz

Any plans to use the push gateway ? https://github.com/prometheus/client_ruby#pushgateway

mikz avatar Aug 08 '17 15:08 mikz

@mikz it has not been deployed yet as part of the monitoring stack, but we can talk about it.

@orimarti can you evaluate the deployment of the pushgateway?

jmprusi avatar Aug 10 '17 11:08 jmprusi

Deployment of pushgateway is more or less easy, if you think you'd need let me know and I'll deploy it.

orimarti avatar Aug 14 '17 15:08 orimarti

I'd maybe try and keep it simple, and start with the library and with a reasonably frequent scrape, and not too many problems of processes dying.... we could get something useful but simple without the Push Gateway?

If needed, then OK..... but maybe not make it too complicated to start with, and see if we have problems with that approach or not?

andrewdavidmackenzie avatar Aug 16 '17 09:08 andrewdavidmackenzie

@orimarti I will assign this one to you... so you can look for the best way to monitor Zync with @mikz

jmprusi avatar Aug 23 '17 07:08 jmprusi

#69 exposed job stats in a text format for prometheus

mikz avatar Oct 05 '17 09:10 mikz