Error sending alert: bad response status 422 Unprocessable Entity
Describe the bug
We are running micro services for cortex, we were using v1 version for alertmanager api by specifying flag -ruler.alertmanager-use-v2=false(used cortex v1.16.0), now we upgrade cortex to v1.17.1, from the log, I see we are using v2 version for alertmanager, when I create some alert rules, I see the alerts fire, but we can't get any email notification, meanwhile we are getting some error messages like:
caller=notifier.go:544 level=error user=Test alertmanager=https://cortex-alertmanager.org/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="bad response status 422 Unprocessable Entity"
To Reproduce Steps to reproduce the behavior:
- Start Cortex (SHA or version): start cortex v1.17.1 with micro service mode
- Perform Operations(Read/Write/Others): create alert rule and observe the logs of ruler
Expected behavior we should get the notifications and no error log should appear.
Environment:
- Infrastructure: [e.g., Kubernetes, bare-metal, laptop]: bare-metal
- Deployment tool: [e.g., helm, jsonnet]: we are using ansible to deploy systemd services for cortex micro services
Additional Context configuration file for cortex ruler:
ExecStart=/usr/sbin/cortex-1.17.1 \
-auth.enabled=true \
-log.level=info \
-config.file=/etc/cortex-ruler/cortex-ruler.yaml \
-runtime-config.file=/etc/cortex-shared/cortex-runtime.yaml \
-server.http-listen-port=8061 \
-server.grpc-listen-port=9061 \
-server.grpc-max-recv-msg-size-bytes=104857600 \
-server.grpc-max-send-msg-size-bytes=104857600 \
-server.grpc-max-concurrent-streams=1000 \
\
-distributor.sharding-strategy=shuffle-sharding \
-distributor.ingestion-tenant-shard-size=12 \
-distributor.replication-factor=2 \
-distributor.shard-by-all-labels=true \
-distributor.zone-awareness-enabled=true \
\
-store.engine=blocks \
-blocks-storage.backend=s3 \
-blocks-storage.s3.endpoint=s3.org:10444 \
-blocks-storage.s3.bucket-name=staging-metrics \
-blocks-storage.s3.insecure=false \
\
-blocks-storage.bucket-store.sync-dir=/local/cortex-ruler/tsdb-sync \
-blocks-storage.bucket-store.metadata-cache.backend=memcached \
-blocks-storage.bucket-store.metadata-cache.memcached.addresses=100.76.51.1:11211,100.76.51.2:11211,100.76.51.3:11211 \
\
-querier.active-query-tracker-dir=/local/cortex-ruler/active-query-tracker \
-querier.ingester-streaming=true \
-querier.query-store-after=23h \
-querier.query-ingesters-within=24h \
-querier.shuffle-sharding-ingesters-lookback-period=25h \
\
-store-gateway.sharding-enabled=true \
-store-gateway.sharding-strategy=shuffle-sharding \
-store-gateway.tenant-shard-size=6 \
-store-gateway.sharding-ring.store=etcd \
-store-gateway.sharding-ring.etcd.endpoints=10.120.121.1:2379 \
-store-gateway.sharding-ring.etcd.endpoints=10.120.121.2:2379 \
-store-gateway.sharding-ring.etcd.endpoints=10.120.121.3:2379 \
-store-gateway.sharding-ring.etcd.endpoints=10.120.121.4:2379 \
-store-gateway.sharding-ring.etcd.endpoints=10.120.121.5:2379 \
-store-gateway.sharding-ring.prefix=cortex-store-gateways/ \
-store-gateway.sharding-ring.replication-factor=2 \
-store-gateway.sharding-ring.zone-awareness-enabled=true \
-store-gateway.sharding-ring.instance-availability-zone=t1 \
-store-gateway.sharding-ring.wait-stability-min-duration=1m \
-store-gateway.sharding-ring.wait-stability-max-duration=5m \
-store-gateway.sharding-ring.instance-addr=100.76.75.1 \
-store-gateway.sharding-ring.instance-id=s_8061 \
-store-gateway.sharding-ring.heartbeat-period=15s \
-store-gateway.sharding-ring.heartbeat-timeout=1m \
\
-ring.store=etcd \
-ring.prefix=cortex-ingesters/ \
-ring.heartbeat-timeout=1m \
-etcd.endpoints=10.120.119.1:2379 \
-etcd.endpoints=10.120.119.2:2379 \
-etcd.endpoints=10.120.119.3:2379 \
-etcd.endpoints=10.120.119.4:2379 \
-etcd.endpoints=10.120.119.5:2379 \
\
-ruler.enable-sharding=true \
-ruler.sharding-strategy=shuffle-sharding \
-ruler.tenant-shard-size=2 \
-ruler.ring.store=etcd \
-ruler.ring.prefix=cortex-rulers/ \
-ruler.ring.num-tokens=32 \
-ruler.ring.heartbeat-period=15s \
-ruler.ring.heartbeat-timeout=1m \
-ruler.ring.etcd.endpoints=10.120.119.1:2379 \
-ruler.ring.etcd.endpoints=10.120.119.2:2379 \
-ruler.ring.etcd.endpoints=10.120.119.3:2379 \
-ruler.ring.etcd.endpoints=10.120.119.4:2379 \
-ruler.ring.etcd.endpoints=10.120.119.5:2379 \
-ruler.ring.instance-id=s_8061 \
-ruler.ring.instance-interface-names=e1 \
\
-ruler.max-rules-per-rule-group=500 \
-ruler.max-rule-groups-per-tenant=5000 \
\
-ruler.external.url=staging-cortex-ruler.org \
-ruler.client.grpc-max-recv-msg-size=104857600 \
-ruler.client.grpc-max-send-msg-size=16777216 \
-ruler.client.grpc-compression= \
-ruler.client.grpc-client-rate-limit=0 \
-ruler.client.grpc-client-rate-limit-burst=0 \
-ruler.client.backoff-on-ratelimits=false \
-ruler.client.backoff-min-period=500ms \
-ruler.client.backoff-max-period=10s \
-ruler.client.backoff-retries=5 \
-ruler.evaluation-interval=15s \
-ruler.poll-interval=15s \
-ruler.rule-path=/local/cortex-ruler/rules \
-ruler.alertmanager-url=https://staging-cortex-alertmanager.org/alertmanager \
-ruler.alertmanager-discovery=false \
-ruler.alertmanager-refresh-interval=1m \
-ruler.notification-queue-capacity=10000 \
-ruler.notification-timeout=10s \
-ruler.flush-period=1m \
-experimental.ruler.enable-api=true \
\
-ruler-storage.backend=s3 \
-ruler-storage.s3.endpoint=s3.org:10444 \
-ruler-storage.s3.bucket-name=staging-rules \
-ruler-storage.s3.insecure=false \
\
-target=ruler
configuration file for cortex alertmanager:
ExecStart=/usr/sbin/cortex-1.17.1 \
-auth.enabled=true \
-log.level=info \
-config.file=/etc/cortex-alertmanager-8071/cortex-alertmanager.yaml \
-runtime-config.file=/etc/cortex-shared/cortex-runtime.yaml \
-server.http-listen-port=8071 \
-server.grpc-listen-port=9071 \
-server.grpc-max-recv-msg-size-bytes=104857600 \
-server.grpc-max-send-msg-size-bytes=104857600 \
-server.grpc-max-concurrent-streams=1000 \
\
-alertmanager.storage.path=/local/cortex-alertmanager-8071/data \
-alertmanager.storage.retention=120h \
-alertmanager.web.external-url=https://staging-cortex-alertmanager.org/alertmanager \
-alertmanager.configs.poll-interval=1m \
-experimental.alertmanager.enable-api=true \
\
-alertmanager.sharding-enabled=true \
-alertmanager.sharding-ring.store=etcd \
-alertmanager.sharding-ring.prefix=cortex-alertmanagers/ \
-alertmanager.sharding-ring.heartbeat-period=15s \
-alertmanager.sharding-ring.heartbeat-timeout=1m \
-alertmanager.sharding-ring.etcd.endpoints=10.120.121.1:2379 \
-alertmanager.sharding-ring.etcd.endpoints=10.120.121.2:2379 \
-alertmanager.sharding-ring.etcd.endpoints=10.120.121.3:2379 \
-alertmanager.sharding-ring.etcd.endpoints=10.120.121.4:2379 \
-alertmanager.sharding-ring.etcd.endpoints=10.120.121.5:2379 \
-alertmanager.sharding-ring.instance-id=b_8071 \
-alertmanager.sharding-ring.instance-interface-names=e1 \
-alertmanager.sharding-ring.replication-factor=2 \
-alertmanager.sharding-ring.zone-awareness-enabled=true \
-alertmanager.sharding-ring.instance-availability-zone=t1 \
\
-alertmanager-storage.backend=s3 \
-alertmanager-storage.s3.endpoint=s3.org:10444 \
-alertmanager-storage.s3.bucket-name=staging-alerts \
-alertmanager-storage.s3.insecure=false \
\
-alertmanager.receivers-firewall-block-cidr-networks=10.163.131.164/28,10.163.131.180/28 \
-alertmanager.receivers-firewall-block-private-addresses=true \
-alertmanager.notification-rate-limit=0 \
-alertmanager.max-config-size-bytes=0 \
-alertmanager.max-templates-count=0 \
-alertmanager.max-template-size-bytes=0 \
\
-target=alertmanager
the configuration for alertmanager:
template_files:
default_template: |
{{ define "__alertmanager" }}AlertManager{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }}
alertmanager_config: |
global:
smtp_smarthost: 'yourmailhost'
smtp_from: 'youraddress'
smtp_require_tls: false
templates:
- 'default_template'
route:
receiver: example-email
receivers:
- name: example-email
email_configs:
- to: 'youraddress'
Hi @friedrichg & @yeya24 , I guess the error message "bad response status 422 Unprocessable Entity" was from altermanager, right? But I didn't find any error log from altermanager even I used debug log level, any suggestion from you will be appreciated!
I want to answer myself so that the others can refer to it.
I manually sent the HTTP request using curl and got the detailed response from alertmanager:
maxFailure (quorum) on a given error family, rpc error: code = Code(422) desc = addr=10.120.131.81:9071 state=ACTIVE zone=z1, rpc error: code = Code(422) desc = {"code":601,"message":"0.generatorURL in body must be of type uri: \"staging-cortex-ruler.org/graph?g0.expr=up%7Bapp%3D%22cert-manager%22%7D+%3E+0\u0026g0.tab=1\""}
So I added the schema "https://" at the beginning of the value -ruler.external.url and then it worked.
Map this to the code:
func (n *Manager) sendOne(ctx context.Context, c *http.Client, url string, b []byte) error {
req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(b))
if err != nil {
return err
}
req.Header.Set("User-Agent", userAgent)
req.Header.Set("Content-Type", contentTypeJSON)
resp, err := n.opts.Do(ctx, c, req)
if err != nil {
return err
}
defer func() {
io.Copy(io.Discard, resp.Body)
resp.Body.Close()
}()
// Any HTTP status 2xx is OK.
//nolint:usestdlibvars
if resp.StatusCode/100 != 2 {
return fmt.Errorf("bad response status %s", resp.Status)
}
return nil
}
Maybe we should add the response body into the error message as well? Currently we only add the status which is not easy for debugging.
@friedrichg @yeya24 should we go ahead and start logging the body of the response? It makes sense IMHO.
@rapphil Agree. Would you like to work on it? Just want to make sure AM doesn't send something crazy in the response body. Maybe we can truncate the message with a limit
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.