alertmanager
alertmanager copied to clipboard
Feature Request: Opsgenie/JSM Response logging
Feature Request:
- More verbose response logging on responses from JSM, log if there is no response to opsgenie/jsm
- Expose metrics for successful alert fires to JSM
Background:
- Recently JSM had an outage, I was able to see in that downtime warnings for truncated messages but I did not see any http error codes or what the responses from JSM were looking like from the logs from the alert manager pod.
Hey, thanks for raising a issue, I did some local testing on this to see how Alertmanager behaves right now, and I think the current implementation might already be verbose enough.
Here’s how you can reproduce the test yourself to see what I mean.
How to Reproduce the test
You just need a couple of terminals to simulate a JSM/Opsgenie API outage.
- Mock Opsgenie api locally:
echo -e "HTTP/1.1 503 Service Unavailable\nContent-Type: application/json\n\n{\"message\":\"JSM API is currently down for maintenance.\"}" | nc -l 9090
- Point alertmanager config to mock opsgenie api
route:
receiver: 'opsgenie'
receivers:
- name: 'opsgenie'
opsgenie_configs:
api_key: 'xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
api_url: 'http://localhost:9090'
message: '{{ .CommonAnnotations.summary }}'
- Fire an alert using amtool
./amtool -v alert add alertname="TestWebServerDown3" \
service="frontend" \
severity="warning" \
--annotation='summary="The web server is not responding."' \
--annotation='description="The main Nginx server on host-01 is failing health checks."' \
--alertmanager.url="http://localhost:9093"
My findings
After running the test explained above here is what I observed:
- Regarding to the logging consideration that you mentioned in the issue, for me it looks verbose enough, alertmanager logs both status code and resp body.
time=2025-07-28T11:46:25.620Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=TestWebServerDown3[3808913][active]
time=2025-07-28T11:46:55.622Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[TestWebServerDown3[3808913][active]]
time=2025-07-28T11:46:55.624Z level=DEBUG source=opsgenie.go:137 msg="extracted group key" integration=opsgenie key={}:{}
time=2025-07-28T11:46:55.634Z level=WARN source=notify.go:867 msg="Notify attempt failed, will retry later" component=dispatcher receiver=opsgenie integration=opsgenie[0] aggrGroup={}:{} attempts=1 err="unexpected status code 503: {\"message\":\"JSM API is currently down for maintenance.\"}\n"
- Coming to the metrics,
alertmanager_notifications_failed_totalmetric incremented perfectly. Since we also havealertmanager_notifications_total, you can easily get the success count in Prometheus with a simple query (total - failed). Adding a dedicated success metric feels redundant when the data is already there.
Given this, I'm not sure the proposed changes are needed. The current setup seems to provide enough detail to diagnose and monitor failures effectively.