alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Feature Request: Opsgenie/JSM Response logging

Open ndlanier opened this issue 6 months ago • 1 comments

Feature Request:

  • More verbose response logging on responses from JSM, log if there is no response to opsgenie/jsm
  • Expose metrics for successful alert fires to JSM

Background:

  • Recently JSM had an outage, I was able to see in that downtime warnings for truncated messages but I did not see any http error codes or what the responses from JSM were looking like from the logs from the alert manager pod.

ndlanier avatar Jun 27 '25 19:06 ndlanier

Hey, thanks for raising a issue, I did some local testing on this to see how Alertmanager behaves right now, and I think the current implementation might already be verbose enough.

Here’s how you can reproduce the test yourself to see what I mean.

How to Reproduce the test

You just need a couple of terminals to simulate a JSM/Opsgenie API outage.

  1. Mock Opsgenie api locally:
echo -e "HTTP/1.1 503 Service Unavailable\nContent-Type: application/json\n\n{\"message\":\"JSM API is currently down for maintenance.\"}" | nc -l 9090
  1. Point alertmanager config to mock opsgenie api
route:
  receiver: 'opsgenie'

receivers:
  - name: 'opsgenie'
    opsgenie_configs:
        api_key: 'xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
        api_url: 'http://localhost:9090'
        message: '{{ .CommonAnnotations.summary }}'
  1. Fire an alert using amtool
./amtool -v alert add alertname="TestWebServerDown3" \
  service="frontend" \
  severity="warning" \
  --annotation='summary="The web server is not responding."' \
  --annotation='description="The main Nginx server on host-01 is failing health checks."' \
  --alertmanager.url="http://localhost:9093"

My findings

After running the test explained above here is what I observed:

  • Regarding to the logging consideration that you mentioned in the issue, for me it looks verbose enough, alertmanager logs both status code and resp body.
time=2025-07-28T11:46:25.620Z level=DEBUG source=dispatch.go:165 msg="Received alert" component=dispatcher alert=TestWebServerDown3[3808913][active]
time=2025-07-28T11:46:55.622Z level=DEBUG source=dispatch.go:530 msg=flushing component=dispatcher aggrGroup={}:{} alerts=[TestWebServerDown3[3808913][active]]
time=2025-07-28T11:46:55.624Z level=DEBUG source=opsgenie.go:137 msg="extracted group key" integration=opsgenie key={}:{}
time=2025-07-28T11:46:55.634Z level=WARN source=notify.go:867 msg="Notify attempt failed, will retry later" component=dispatcher receiver=opsgenie integration=opsgenie[0] aggrGroup={}:{} attempts=1 err="unexpected status code 503: {\"message\":\"JSM API is currently down for maintenance.\"}\n"
  • Coming to the metrics, alertmanager_notifications_failed_total metric incremented perfectly. Since we also have alertmanager_notifications_total, you can easily get the success count in Prometheus with a simple query (total - failed). Adding a dedicated success metric feels redundant when the data is already there.
Image

Given this, I'm not sure the proposed changes are needed. The current setup seems to provide enough detail to diagnose and monitor failures effectively.

pehlicd avatar Jul 28 '25 12:07 pehlicd