WIP: First iteration of a prometheus exporter for ara
As discussed on the issue for this topic: https://github.com/ansible-community/ara/issues/177
It's not finished and still very much a WIP but I figured it might be worthwhile to iterate under a branch in a PR instead of the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0
If prometheus_client is installed, there will be an ara prometheus command to expose prometheus metrics gathered and parsed from an ara instance:
usage: ara prometheus [-h] [--client <client>] [--server <url>] [--timeout <seconds>] [--username <username>] [--password <password>] [--ssl-cert <path/to/certificate>] [--ssl-key <path/to/key>] [--ssl-ca <path/to/cacert>] [--insecure]
[--playbook-limit PLAYBOOK_LIMIT] [--task-limit TASK_LIMIT] [--host-limit HOST_LIMIT] [--poll-frequency POLL_FREQUENCY] [--prometheus-port PROMETHEUS_PORT]
Exposes a prometheus exporter to provide metrics from an instance of ara
options:
-h, --help show this help message and exit
--client <client>
API client to use, defaults to ARA_API_CLIENT or 'offline'
--server <url>
API server endpoint if using http client, defaults to ARA_API_SERVER or 'http://127.0.0.1:8000'
--timeout <seconds>
Timeout for requests to API server, defaults to ARA_API_TIMEOUT or 30
--username <username>
API server username for authentication, defaults to ARA_API_USERNAME or None
--password <password>
API server password for authentication, defaults to ARA_API_PASSWORD or None
--ssl-cert <path/to/certificate>
If a client certificate is required, the path to the certificate to use, defaults to ARA_API_CERT or None
--ssl-key <path/to/key>
If a client certificate is required, the path to the private key to use, defaults to ARA_API_KEY or None
--ssl-ca <path/to/cacert>
Path to a certificate authority for trusting the API server certificate, defaults to ARA_API_CA or None
--insecure Ignore SSL certificate validation, defaults to ARA_API_INSECURE or False
--playbook-limit PLAYBOOK_LIMIT
Max number of playbooks to request at once (default: 1000)
--task-limit TASK_LIMIT
Max number of tasks to request at once (default: 2500)
--host-limit HOST_LIMIT
Max number of hosts to request at once (default: 2500)
--poll-frequency POLL_FREQUENCY
Seconds to wait until querying ara for new metrics (default: 60)
--prometheus-port PROMETHEUS_PORT
Port on which the prometheus exporter will listen (default: 8001)
Build failed. https://ansible.softwarefactory-project.io/zuul/buildset/f9d8f487b49d447d8f37dc2007613d34
:heavy_check_mark: ara-tox-py3 SUCCESS in 4m 09s :x: ara-tox-linters FAILURE in 3m 32s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 5m 33s (non-voting) :heavy_check_mark: ara-basic-ansible-6 SUCCESS in 5m 09s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 5m 35s :heavy_check_mark: ara-basic-ansible-core-2.13 SUCCESS in 5m 03s :heavy_check_mark: ara-basic-ansible-core-2.12 SUCCESS in 5m 04s :heavy_check_mark: ara-basic-ansible-core-2.11 SUCCESS in 5m 20s :heavy_check_mark: ara-basic-ansible-2.9 SUCCESS in 5m 08s :heavy_check_mark: ara-container-images SUCCESS in 11m 19s
I've added a bit more context in the issue (https://github.com/ansible-community/ara/issues/177#issuecomment-1442713388) and got two quick iterations in:
- Added --max-days to limit backfill at boot
- Added a bit of verbosity
- Adjust hosts to be scanned before tasks (there are way, way more tasks than hosts in terms of volume)
- First try at a playbook histogram containing the timestamp and duration
Edit: I've put up an example /metrics response from a single playbook's metric as an histogram in the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0#file-playbooks_as_histogram-txt
It wants to group metrics based on their label uniqueness, I suppose in our case we want each playbook to be represented individually so we should include their id ? More on that later.
Build failed. https://ansible.softwarefactory-project.io/zuul/buildset/d069974d12c14515aded43c6df617003
:heavy_check_mark: ara-tox-py3 SUCCESS in 3m 24s :x: ara-tox-linters FAILURE in 3m 15s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 5m 50s (non-voting) :heavy_check_mark: ara-basic-ansible-6 SUCCESS in 5m 09s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 5m 26s :heavy_check_mark: ara-basic-ansible-core-2.13 SUCCESS in 5m 15s :heavy_check_mark: ara-basic-ansible-core-2.12 SUCCESS in 5m 16s :heavy_check_mark: ara-basic-ansible-core-2.11 SUCCESS in 6m 29s :heavy_check_mark: ara-basic-ansible-2.9 SUCCESS in 5m 28s :heavy_check_mark: ara-container-images SUCCESS in 11m 56s
I think my brain is starting to understand what is happening.
I've temporarily commented out the current iteration of the playbook metrics until I revisit it with newfound knowledge.
This latest iteration re-works the host and tasks metrics to have gauges per status such that we are able to do graphs like this, for example:
Prometheus task results in grafana
Prometheus host results in grafana
A snippet of what this looks like when querying the prometheus exporter:
# HELP ara_tasks_total Number of tasks recorded by ara in prometheus
# TYPE ara_tasks_total gauge
ara_tasks_total 403.0
# HELP ara_tasks_range Limit metric collection to the N most recent tasks
# TYPE ara_tasks_range gauge
ara_tasks_range 2500.0
# HELP ara_tasks_completed Completed Ansible tasks
# TYPE ara_tasks_completed gauge
ara_tasks_completed{action="command",duration="00:00:00.294820",name="Echo the �abc binary string",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",playbook="30",results="1",status="completed",updated="2023-06-08T02:43:29.665787Z"} 1.0
ara_tasks_completed{action="debug",duration="00:00:00.155210",name="Task with non-ascii characters - ä, ö, ü",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",playbook="30",results="1",status="completed",updated="2023-06-08T02:43:29.317583Z"} 1.0
ara_tasks_completed{action="gather_facts",duration="00:00:01.035601",name="Gathering Facts",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml",playbook="30",results="1",status="completed",updated="2023-06-08T02:43:29.098823Z"} 1.0
# HELP ara_tasks_failed Failed Ansible tasks
# TYPE ara_tasks_failed gauge
ara_tasks_failed{action="command",duration="00:00:00.455411",name="smoke-tests : Return false",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/roles/smoke-tests/tasks/test-ops.yaml",playbook="30",results="1",status="failed",updated="2023-06-08T02:43:25.190901Z"} 1.0
ara_tasks_failed{action="fail",duration="00:00:00.210469",name="fail",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/failed.yaml",playbook="29",results="1",status="failed",updated="2023-06-08T02:43:07.648379Z"} 1.0
ara_tasks_failed{action="fail",duration="00:00:00.219566",name="Generate a failure that will be rescued",path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/lookups.yaml",playbook="26",results="1",status="failed",updated="2023-06-08T02:32:51.180755Z"} 1.0
# ...
# HELP ara_hosts_total Hosts recorded by ara
# TYPE ara_hosts_total gauge
ara_hosts_total 43.0
# HELP ara_hosts_range Limit metric collection to the N most recent hosts
# TYPE ara_hosts_range gauge
ara_hosts_range 2500.0
# HELP ara_hosts_changed Number of changes on a host
# TYPE ara_hosts_changed gauge
ara_hosts_changed{name="localhost",playbook="30",updated="2023-06-08T02:43:29.848077Z"} 10.0
ara_hosts_changed{name="localhost",playbook="28",updated="2023-06-08T02:33:20.625359Z"} 1.0
ara_hosts_changed{name="localhost",playbook="26",updated="2023-06-08T02:32:54.179356Z"} 1.0
# HELP ara_hosts_failed Number of failures on a host
# TYPE ara_hosts_failed gauge
ara_hosts_failed{name="localhost",playbook="29",updated="2023-06-08T02:43:07.767992Z"} 1.0
ara_hosts_failed{name="localhost",playbook="24",updated="2023-06-08T02:32:18.773096Z"} 1.0
ara_hosts_failed{name="localhost",playbook="23",updated="2023-06-08T02:04:04.810142Z"} 1.0
# ...
Build failed. https://ansible.softwarefactory-project.io/zuul/buildset/75ed0374bc6e4344af27503fe6350e60
:heavy_check_mark: ara-tox-py3 SUCCESS in 9m 57s :x: ara-tox-linters FAILURE in 9m 48s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 4m 59s (non-voting) :heavy_check_mark: ara-basic-ansible-6 SUCCESS in 6m 11s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 6m 01s :heavy_check_mark: ara-basic-ansible-core-2.13 SUCCESS in 10m 57s :heavy_check_mark: ara-basic-ansible-core-2.12 SUCCESS in 10m 38s :heavy_check_mark: ara-basic-ansible-core-2.11 SUCCESS in 10m 51s :heavy_check_mark: ara-basic-ansible-2.9 SUCCESS in 10m 50s :heavy_check_mark: ara-container-images SUCCESS in 17m 13s
Build failed. https://ansible.softwarefactory-project.io/zuul/buildset/4c6c9dea87f14d93aa1ec28b71ebc083
:heavy_check_mark: ara-tox-py3 SUCCESS in 4m 14s :x: ara-tox-linters FAILURE in 3m 12s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 6m 20s (non-voting) :heavy_check_mark: ara-basic-ansible-6 SUCCESS in 7m 07s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 8m 02s :heavy_check_mark: ara-basic-ansible-core-2.13 SUCCESS in 6m 20s :heavy_check_mark: ara-basic-ansible-core-2.12 SUCCESS in 5m 32s :heavy_check_mark: ara-basic-ansible-core-2.11 SUCCESS in 6m 17s :heavy_check_mark: ara-basic-ansible-2.9 SUCCESS in 5m 40s :heavy_check_mark: ara-container-images SUCCESS in 11m 13s
Build succeeded. https://ansible.softwarefactory-project.io/zuul/buildset/59731f5a132942749960db45ae05a18a
:heavy_check_mark: ara-tox-py3 SUCCESS in 4m 15s :heavy_check_mark: ara-tox-linters SUCCESS in 3m 57s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 7m 09s (non-voting) :heavy_check_mark: ara-basic-ansible-6 SUCCESS in 6m 09s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 6m 24s :heavy_check_mark: ara-basic-ansible-core-2.13 SUCCESS in 6m 01s :heavy_check_mark: ara-basic-ansible-core-2.12 SUCCESS in 6m 30s :heavy_check_mark: ara-basic-ansible-core-2.11 SUCCESS in 6m 08s :heavy_check_mark: ara-basic-ansible-2.9 SUCCESS in 6m 31s :heavy_check_mark: ara-container-images SUCCESS in 11m 36s
Lots of cleanup in this last iteration and I've done some tweaking on the grafana dashboard.
It looks like this now:
Build failed. https://ansible.softwarefactory-project.io/zuul/buildset/0eed3702b4444312b85e762bc95e51dc
:heavy_check_mark: ara-tox-py3 SUCCESS in 3m 12s :x: ara-tox-linters FAILURE in 3m 12s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 6m 16s (non-voting) :heavy_check_mark: ara-basic-ansible-6 SUCCESS in 5m 58s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 5m 20s :heavy_check_mark: ara-basic-ansible-core-2.13 SUCCESS in 6m 54s :heavy_check_mark: ara-basic-ansible-core-2.12 SUCCESS in 4m 51s :heavy_check_mark: ara-basic-ansible-core-2.11 SUCCESS in 6m 03s :heavy_check_mark: ara-basic-ansible-2.9 SUCCESS in 5m 08s :heavy_check_mark: ara-container-images SUCCESS in 11m 33s
Build failed. https://ansible.softwarefactory-project.io/zuul/buildset/fe23cb058a504bc48f68b007b1d4de91
:heavy_check_mark: ara-tox-py3 SUCCESS in 3m 15s :x: ara-tox-linters FAILURE in 3m 07s :heavy_check_mark: ara-tox-docs SUCCESS in 7m 57s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 5m 09s (non-voting) :heavy_check_mark: ara-basic-ansible-6 SUCCESS in 5m 03s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 11m 10s :heavy_check_mark: ara-basic-ansible-core-2.13 SUCCESS in 5m 06s :heavy_check_mark: ara-basic-ansible-core-2.12 SUCCESS in 5m 06s :heavy_check_mark: ara-basic-ansible-core-2.11 SUCCESS in 4m 45s :heavy_check_mark: ara-basic-ansible-2.9 SUCCESS in 5m 08s :heavy_check_mark: ara-container-images SUCCESS in 10m 57s
I feel this is ready for a first look to a wider audience so I've asked around for testing and feedback:
- https://fosstodon.org/@ara/110582123918416479
- https://old.reddit.com/r/ansible/comments/14f65ik/experimental_prometheus_exporter_for_ansible/
The final implementation may change before landing (for example if I screwed up in metric types) but this will be useful to make sure we did the right decisions and do the necessary changes before merging.
I am narrowing the scope of this first PR to playbooks, tasks and hosts for now. Results and plays can come in a later patch as necessary.
Build failed. https://ansible.softwarefactory-project.io/zuul/buildset/5332cbba06be4ca09a29ccbfe24bb719
:heavy_check_mark: ara-tox-py3 SUCCESS in 3m 50s :x: ara-tox-linters FAILURE in 3m 56s :heavy_check_mark: ara-tox-docs SUCCESS in 3m 58s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 6m 17s (non-voting) :heavy_check_mark: ara-basic-ansible-8 SUCCESS in 6m 03s :heavy_check_mark: ara-basic-ansible-core-2.15 SUCCESS in 6m 53s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 5m 23s :heavy_check_mark: ara-basic-ansible-2.9 SUCCESS in 6m 06s :heavy_check_mark: ara-container-images SUCCESS in 12m 00s
Build failed. https://ansible.softwarefactory-project.io/zuul/buildset/51c4f4164d66409bbf48568389543706
:heavy_check_mark: ara-tox-py3 SUCCESS in 3m 49s :x: ara-tox-linters FAILURE in 3m 53s :heavy_check_mark: ara-tox-docs SUCCESS in 3m 11s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 6m 03s (non-voting) :heavy_check_mark: ara-basic-ansible-8 SUCCESS in 6m 01s :heavy_check_mark: ara-basic-ansible-core-2.15 SUCCESS in 7m 29s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 7m 20s :heavy_check_mark: ara-basic-ansible-2.9 SUCCESS in 5m 55s :heavy_check_mark: ara-container-images SUCCESS in 11m 19s
Nothing special pushed, just rebased on top of latest master.
Build failed. https://ansible.softwarefactory-project.io/zuul/buildset/7f750024dd7b42b2987983a14fc3a884
:heavy_check_mark: ara-tox-py3 SUCCESS in 4m 05s :x: ara-tox-linters FAILURE in 3m 50s :heavy_check_mark: ara-tox-docs SUCCESS in 3m 15s :heavy_check_mark: ara-basic-ansible-core-devel SUCCESS in 6m 55s (non-voting) :heavy_check_mark: ara-basic-ansible-8 SUCCESS in 7m 00s :heavy_check_mark: ara-basic-ansible-core-2.15 SUCCESS in 6m 58s :heavy_check_mark: ara-basic-ansible-core-2.14 SUCCESS in 6m 21s :heavy_check_mark: ara-container-images SUCCESS in 13m 52s
I will eventually include it in the docs but in the meantime, I've come up with the following graph that explains how one might use the exporter:
┌──────────────────┐
┌────────────┐ promql ┌─────────┐ │ ansible-playbook │
│ Prometheus │◄───────┤ Grafana │ │ (with ara) │
└──────┬─────┘ └─────────┘ └───────┬──────────┘
│ │
│ scrapes /metrics │ collects data
│ & stores results │ & sends it
│ │
┌──────────▼──────────┐ ┌───────▼────────┐
│ Prometheus Exporter ├──────────────►│ ara API server │
│ (prometheus_client) │ query metrics │ (django) ┌──┴─────────┐
└─────────────────────┘ └─────────────┤ recorded │
│ playbooks │
└────────────┘
Hi, I was at ansible meetup in OVH building at montreal, your presentation was really good. In prometheus, it's bad, when value of tag change during polling interval for one metric, it's better to transform the tag into metric.
I think you can transform for example this metric : ara_tasks_completed{ action="command", duration="00:00:00.294820", name="Echo the �abc binary string", path="/home/dmsimard/dev/git/ansible-community/ara/tests/integration/smoke.yaml", playbook="30", results="1", status="completed", updated="2023-06-08T02:43:29.665787Z"} 1.0
into several metric, ara_tasks_status { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } 1 (you can map value of integer to status name (1 for completed', 2 for running', 3 for 'failed)
ara_tasks_duration { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } number seconds (or micro seconds if needed)
ara_tasks_results { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } 1
We can work together to build correct metric, then we will produce correct python for exporter.
Hi @voileux and thanks for reaching out!
What you suggest makes sense to me and it's worth looking into.
I don't have bandwidth to look into this /right now/ but I will revisit this in the near future.
Hello,
depending on your goal here : it might be easier for you to limit the "exporter part" to what you want to monitor live (i.e. what you want to trigger alerts on)
And for the visualization aspects, directly connect grafana to your database with the specific grafana datasource:
- https://grafana.com/docs/grafana/latest/datasources/mysql/
- https://grafana.com/docs/grafana/latest/datasources/postgres/
something like :
flowchart TD
G[Grafana] -->|promql <br/> visualize <b>alerts</b><br/> and correlate current metrics| P(Prometheus )
G -->|db datasource <br/> visualize <b>metrics</b> <br/>current and historical| D
W(alertmanager) -->|promql<br/>trigger alerts| P
P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
A(ansible playbook) -->|collects data<br/>& sends it| D
instead of (from your previous schema here)
flowchart TD
G[Grafana] -->|promql| P(Prometheus)
P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
A(ansible playbook) -->|collects data<br/>& sends it| D
(edit: I forgot to put the mermaid keyword, and took this opportunity to add alertmanager & clarify the schema equivalent to the one you presented before)
This indeed requires you to rewrite your panels in grafana in order to make use of the proper SQL, and you will need to open the connection between grafana and your DB
Also it avoids to transform the whole content of the DB opentelemetry format and scraping it each time, which will scale better :-D
Hi, I haven't revisited this in a little while but I wanted to say it was still on my radar and I plan to work on this some more in the near future.