ara A prometheus exporter for playbook metrics ?

What is the idea ?

I haven't yet looked at what this would look like in practice but we have a lot of metrics in ara, such as:

status, amount and durations of playbooks, plays, tasks and results
number of hosts and amount of tasks per hosts
number of tasks per module (ex: https://api.demo.recordsansible.org/api/v1/tasks?command=debug count)

We could explore how to make these metrics useful for monitoring or fancy graphs.

Sep 24 '20 12:09 dmsimard

Hi !

I think I can take a look at that, I'm not very familiar with ara bit I've been using a while ago and I'm also a casual prometheus user. I'll dive into the code to see If I'm of any help this weekend 😉

Oct 08 '20 16:10 IlyesSemlali

@IlyesSemlali nice, thanks for looking at it! Feel free to reach out on slack or irc and I can help pointing you in the right direction.

Oct 08 '20 16:10 dmsimard

@IlyesSemlali o/

Wanted to let you know that I've proposed an implementation to query ara in order to return metrics about tasks: https://review.opendev.org/#/c/760736/

For example, metrics from the last 1000 tasks:

> ara task metrics --limit 1000
+-----------------------+----------------+----------------+-------------+-----------+---------+---------+---------+
| action                | duration_total | duration_avg   | occurrences | completed | running | expired | unknown |
+-----------------------+----------------+----------------+-------------+-----------+---------+---------+---------+
| add_host              | 0:00:02.896122 | 0:00:00.160896 |          18 |        18 |       0 |       0 |       0 |
| ara_playbook          | 0:00:07.786123 | 0:00:00.432562 |          18 |        18 |       0 |       0 |       0 |
| ara_record            | 0:00:44.732937 | 0:00:00.451848 |          99 |        99 |       0 |       0 |       0 |
| assemble              | 0:00:03.295141 | 0:00:01.647570 |           2 |         2 |       0 |       0 |       0 |
| assert                | 0:00:18.892062 | 0:00:00.174926 |         108 |       108 |       0 |       0 |       0 |
| command               | 0:00:33.964852 | 0:00:00.640846 |          53 |        44 |       9 |       0 |       0 |
| copy                  | 0:00:16.585412 | 0:00:01.658541 |          10 |        10 |       0 |       0 |       0 |
| debug                 | 0:04:41.404371 | 0:00:01.839244 |         153 |       153 |       0 |       0 |       0 |
| fail                  | 0:00:01.752986 | 0:00:00.159362 |          11 |        11 |       0 |       0 |       0 |
| file                  | 0:00:19.609704 | 0:00:01.032090 |          19 |        19 |       0 |       0 |       0 |
| find                  | 0:00:00.828836 | 0:00:00.414418 |           2 |         2 |       0 |       0 |       0 |
| gather_facts          | 0:00:47.211225 | 0:00:01.026331 |          46 |        46 |       0 |       0 |       0 |
| group_by              | 0:00:01.583436 | 0:00:00.395859 |           4 |         4 |       0 |       0 |       0 |
| include_role          | 0:00:18.616065 | 0:00:00.282062 |          66 |        66 |       0 |       0 |       0 |
| include_tasks         | 0:00:29.881530 | 0:00:00.335748 |          89 |        89 |       0 |       0 |       0 |
| kolla_container_facts | 0:00:00.966302 | 0:00:00.966302 |           1 |         1 |       0 |       0 |       0 |
| kolla_docker          | 0:01:45.791232 | 0:00:03.111507 |          34 |        34 |       0 |       0 |       0 |
| kolla_toolbox         | 0:06:10.222628 | 0:00:04.936302 |          75 |        75 |       0 |       0 |       0 |
| merge_configs         | 0:01:17.487438 | 0:00:02.869905 |          27 |        27 |       0 |       0 |       0 |
| modprobe              | 0:00:01.730669 | 0:00:00.576890 |           3 |         3 |       0 |       0 |       0 |
| ping                  | 0:00:04.945961 | 0:00:00.549551 |           9 |         9 |       0 |       0 |       0 |
| set_fact              | 0:00:38.433904 | 0:00:00.541323 |          71 |        71 |       0 |       0 |       0 |
| setup                 | 0:00:09.907015 | 0:00:01.100779 |           9 |         9 |       0 |       0 |       0 |
| shell                 | 0:00:00.775224 | 0:00:00.387612 |           2 |         2 |       0 |       0 |       0 |
| stat                  | 0:00:03.386744 | 0:00:00.211672 |          16 |        16 |       0 |       0 |       0 |
| sysctl                | 0:00:09.490360 | 0:00:04.745180 |           2 |         2 |       0 |       0 |       0 |
| systemd               | 0:00:00.290391 | 0:00:00.290391 |           1 |         1 |       0 |       0 |       0 |
| template              | 0:01:16.878769 | 0:00:01.507427 |          51 |        51 |       0 |       0 |       0 |
| wait_for              | 0:00:00.704144 | 0:00:00.704144 |           1 |         1 |       0 |       0 |       0 |
+-----------------------+----------------+----------------+-------------+-----------+---------+---------+---------+

The CLI framework lets us return that data in json or csv which I suppose could then be made available by a prometheus exporter:

> ara task metrics --limit 1000 -f csv
"action","duration_total","duration_avg","occurrences","completed","running","expired","unknown"
"add_host","0:00:02.896122","0:00:00.160896",18,18,0,0,0
"ara_playbook","0:00:07.786123","0:00:00.432562",18,18,0,0,0
"ara_record","0:00:44.732937","0:00:00.451848",99,99,0,0,0
"assemble","0:00:03.295141","0:00:01.647570",2,2,0,0,0
"assert","0:00:18.892062","0:00:00.174926",108,108,0,0,0
"command","0:00:33.964852","0:00:00.640846",53,44,9,0,0
"copy","0:00:16.585412","0:00:01.658541",10,10,0,0,0
"debug","0:04:41.404371","0:00:01.839244",153,153,0,0,0
"fail","0:00:01.752986","0:00:00.159362",11,11,0,0,0
"file","0:00:19.609704","0:00:01.032090",19,19,0,0,0
"find","0:00:00.828836","0:00:00.414418",2,2,0,0,0
"gather_facts","0:00:47.211225","0:00:01.026331",46,46,0,0,0
"group_by","0:00:01.583436","0:00:00.395859",4,4,0,0,0
"include_role","0:00:18.616065","0:00:00.282062",66,66,0,0,0
"include_tasks","0:00:29.881530","0:00:00.335748",89,89,0,0,0
"kolla_container_facts","0:00:00.966302","0:00:00.966302",1,1,0,0,0
"kolla_docker","0:01:45.791232","0:00:03.111507",34,34,0,0,0
"kolla_toolbox","0:06:10.222628","0:00:04.936302",75,75,0,0,0
"merge_configs","0:01:17.487438","0:00:02.869905",27,27,0,0,0
"modprobe","0:00:01.730669","0:00:00.576890",3,3,0,0,0
"ping","0:00:04.945961","0:00:00.549551",9,9,0,0,0
"set_fact","0:00:38.433904","0:00:00.541323",71,71,0,0,0
"setup","0:00:09.907015","0:00:01.100779",9,9,0,0,0
"shell","0:00:00.775224","0:00:00.387612",2,2,0,0,0
"stat","0:00:03.386744","0:00:00.211672",16,16,0,0,0
"sysctl","0:00:09.490360","0:00:04.745180",2,2,0,0,0
"systemd","0:00:00.290391","0:00:00.290391",1,1,0,0,0
"template","0:01:16.878769","0:00:01.507427",51,51,0,0,0
"wait_for","0:00:00.704144","0:00:00.704144",1,1,0,0,0

Let me know what you think ? This is for tasks but we can do a similar approach for more granular host and result metrics too.

Nov 01 '20 21:11 dmsimard

There are other works in progress for ara playbook metrics and ara host metrics.

I don't have a lot of experience with prometheus but there is a client library that we could use in python: https://pypi.org/project/prometheus-client/

Nov 04 '20 02:11 dmsimard

@IlyesSemlali @dmsimard any further plans here?

Feb 08 '22 07:02 b-reich

@b-reich I am not using prometheus at this time and so I am not pursuing this right now.

You may want to look at the following CLI commands to find out if there is something that can help:

https://ara.readthedocs.io/en/latest/cli.html#ara-host-metrics
https://ara.readthedocs.io/en/latest/cli.html#ara-playbook-metrics
https://ara.readthedocs.io/en/latest/cli.html#ara-task-metrics

Feb 08 '22 13:02 dmsimard

Hello,

In my work, I use Ansible with ARA and the possibility of having metrics in the Prometheus/Grafana stack interests me a lot. I don't know if anyone is working on this, but on my end I'm forking the repo to try and add the use of prometheus-client for a metrics page. If I manage to have a metrics page, I will set up a first series of metrics and I will make a Grafana dashboard to go with it.

Jul 01 '22 10:07 TibScript

Hi @TibScript, this may surface on my end in the not too distant future. Did you end up with something that works ?

Dec 20 '22 16:12 dmsimard

I am experimenting and learning this as I go -- I don't believe I am using the right approach but I'm sharing this in case anyone has suggestions, comments or would like to improve on it: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0

1 snTXFElFuQLSFDnvZKJ6IA

I've included a sample of the /metrics endpoint in the gist.

Edit: I forgot to mention that something notably missing from this first iteration are playbook and task durations. I've tried to implement them using Summary (instead of a Gauge) but haven't really figured out how they work yet.

Here's two sample screenshots showing it works with a local prometheus instance: Screenshot from 2023-02-19 21-19-24

Feb 20 '23 00:02 dmsimard

Still not sure where this is going, maybe I should put this in a branch and a PR but a few updates:

Added support for querying results through pagination
Query everything at boot via result limit (i.e, ?limit=1000) and pagination
Store the latest object timestamp such that next scrape will only pick up objects created after that using ?created_after=<timestamp> (thanks built-in support into the API)

I still want to do something about durations and timestamps but haven't made it yet.

In the meantime, here's what it looks like to ingest every playbook, task and host from demo.recordsansible.org:

> ./prometheus_exporter.py
2023-02-20T22:45:23.327849: ara Prometheus exporter listening on http://0.0.0.0:8000/metrics
2023-02-20T22:45:23.328241: collecting playbook metrics  # <-- with limit=1000
2023-02-20T22:45:59.038076: parsing metrics for 3641 playbooks
2023-02-20T22:45:59.128715: finished updating playbook metrics
2023-02-20T22:45:59.133970: collecting task metrics  # <-- with limit=2500
2023-02-20T23:18:35.283720: parsing metrics for 557031 tasks
2023-02-20T23:18:40.065098: finished updating task metrics
2023-02-20T23:18:40.378556: collecting host metrics  # <-- with limit=2500
2023-02-20T23:18:42.111483: parsing metrics for 9983 hosts
2023-02-20T23:18:42.238810: finished updating host metrics
2023-02-20T23:19:12.240766: collecting playbook metrics
2023-02-20T23:19:12.399595: finished updating playbook metrics
2023-02-20T23:19:12.399644: collecting task metrics
2023-02-20T23:19:12.694654: finished updating task metrics
2023-02-20T23:19:12.694715: collecting host metrics
2023-02-20T23:19:12.741372: finished updating host metrics

A random screenshot of the data: Screenshot from 2023-02-20 23-45-22

It can be slow to boot up the exporter at first because it needs to scrape everything. This is for playbooks, tasks and hosts -- I haven't yet touched results and there's 827 591 of those on the demo server.

This is multiple years of mostly integration test playbooks but the performance will largely depend on the scale and volume of data. We should probably have an argument that controls the amount of time we crawl data for -- like a default of 90 days for example ?

Feb 21 '23 05:02 dmsimard

For a third iteration, I decided to move the standalone script into the ara CLI so it's possible to start the exporter by running ara prometheus.

I've opened up a branch and a PR so I will work there instead of the gist.

Feb 22 '23 03:02 dmsimard

Hey @dmsimard my familiarity with Prometheus is Nautobot, which is also Django based, and they leverage this library, django-prometheus

I believe it has alot of built in django metrics that you can enable by default, from API performance, model performance, db queries, and the like, but what seems to be the common thread is that they are able to quickly add new things that we might want to look at about the data Nautobot holds as well.

For Ansible, I could see things like job/task performance, host failure hot list (which hosts are failing most often), number of times a playbook was ran, technically I think you have alot of great metrics that you show in various ways already, but exporting to prometheus lets folks use something like Grafana to graph those things that are important to them with minimal effort or middleware required for them to write. So in my opinion, just starting with some of the metrics about plays, hosts, tasks, etc that you have now would get the ball rolling!

Feb 22 '23 15:02 netopsengineer

Hey @dmsimard my familiarity with Prometheus is Nautobot, which is also Django based, and they leverage this library, django-prometheus

I believe it has alot of built in django metrics that you can enable by default, from API performance, model performance, db queries, and the like, but what seems to be the common thread is that they are able to quickly add new things that we might want to look at about the data Nautobot holds as well.

:wave: @netopsengineer

I have not considered the django side of the metrics yet but it's true that it can be useful and it's good to know, thanks !

If anyone wants to tackle this, they can go ahead as I continue iterating on the playbook metrics.

For Ansible, I could see things like job/task performance, host failure hot list (which hosts are failing most often), number of times a playbook was ran, technically I think you have alot of great metrics that you show in various ways already, but exporting to prometheus lets folks use something like Grafana to graph those things that are important to them with minimal effort or middleware required for them to write. So in my opinion, just starting with some of the metrics about plays, hosts, tasks, etc that you have now would get the ball rolling!

Yes, we can consider that one of the objectives is pretty graphs about playbook metrics in grafana :p

While I have been a user (and operator) of both prometheus and grafana, I have been mostly privileged by the fact that so many exporters and graphs had already been written so until now I have not needed to truly learn how it all works underneath. There be dragons.

The challenge is parsing data into the right formats (and field types) as a proper time series with the timestamps provided by ara -- not the time the metric sample is taken. Then, we probably need to some math and find out the right arcane grafana or promql query to produce the pretty graphs.

If anyone wants to help or point me in the right direction, head to the PR :pray:

Feb 24 '23 02:02 dmsimard

I have come across this insightful mailing list thread about ingesting metrics with supplied timestamps: https://groups.google.com/g/prometheus-users/c/YqFc1MZLCsM

There is a suggestion to try Histograms instead of Gauges so I will look into that next.

Edit: some additional reading I've come across about setting timestamps on metrics:

https://github.com/prometheus/client_python/issues/594
https://github.com/prometheus/client_python/issues/588
https://github.com/prometheus/client_python/issues/725

Feb 24 '23 02:02 dmsimard