cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Error sending alert: bad response status 422 Unprocessable Entity

Open mousimin opened this issue 1 year ago • 5 comments

Describe the bug We are running micro services for cortex, we were using v1 version for alertmanager api by specifying flag -ruler.alertmanager-use-v2=false(used cortex v1.16.0), now we upgrade cortex to v1.17.1, from the log, I see we are using v2 version for alertmanager, when I create some alert rules, I see the alerts fire, but we can't get any email notification, meanwhile we are getting some error messages like: caller=notifier.go:544 level=error user=Test alertmanager=https://cortex-alertmanager.org/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="bad response status 422 Unprocessable Entity"

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex (SHA or version): start cortex v1.17.1 with micro service mode
  2. Perform Operations(Read/Write/Others): create alert rule and observe the logs of ruler

Expected behavior we should get the notifications and no error log should appear.

Environment:

  • Infrastructure: [e.g., Kubernetes, bare-metal, laptop]: bare-metal
  • Deployment tool: [e.g., helm, jsonnet]: we are using ansible to deploy systemd services for cortex micro services

Additional Context configuration file for cortex ruler:

ExecStart=/usr/sbin/cortex-1.17.1 \
  -auth.enabled=true \
  -log.level=info \
  -config.file=/etc/cortex-ruler/cortex-ruler.yaml \
  -runtime-config.file=/etc/cortex-shared/cortex-runtime.yaml \
  -server.http-listen-port=8061 \
  -server.grpc-listen-port=9061 \
  -server.grpc-max-recv-msg-size-bytes=104857600 \
  -server.grpc-max-send-msg-size-bytes=104857600 \
  -server.grpc-max-concurrent-streams=1000 \
  \
  -distributor.sharding-strategy=shuffle-sharding \
  -distributor.ingestion-tenant-shard-size=12 \
  -distributor.replication-factor=2 \
  -distributor.shard-by-all-labels=true \
  -distributor.zone-awareness-enabled=true \
  \
  -store.engine=blocks \
  -blocks-storage.backend=s3 \
  -blocks-storage.s3.endpoint=s3.org:10444 \
  -blocks-storage.s3.bucket-name=staging-metrics \
  -blocks-storage.s3.insecure=false \
  \
  -blocks-storage.bucket-store.sync-dir=/local/cortex-ruler/tsdb-sync \
  -blocks-storage.bucket-store.metadata-cache.backend=memcached \
  -blocks-storage.bucket-store.metadata-cache.memcached.addresses=100.76.51.1:11211,100.76.51.2:11211,100.76.51.3:11211 \
  \
  -querier.active-query-tracker-dir=/local/cortex-ruler/active-query-tracker \
  -querier.ingester-streaming=true \
  -querier.query-store-after=23h \
  -querier.query-ingesters-within=24h \
  -querier.shuffle-sharding-ingesters-lookback-period=25h \
  \
  -store-gateway.sharding-enabled=true \
  -store-gateway.sharding-strategy=shuffle-sharding \
  -store-gateway.tenant-shard-size=6 \
  -store-gateway.sharding-ring.store=etcd \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.1:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.2:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.3:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.4:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.5:2379 \
  -store-gateway.sharding-ring.prefix=cortex-store-gateways/ \
  -store-gateway.sharding-ring.replication-factor=2 \
  -store-gateway.sharding-ring.zone-awareness-enabled=true \
  -store-gateway.sharding-ring.instance-availability-zone=t1 \
  -store-gateway.sharding-ring.wait-stability-min-duration=1m \
  -store-gateway.sharding-ring.wait-stability-max-duration=5m \
  -store-gateway.sharding-ring.instance-addr=100.76.75.1 \
  -store-gateway.sharding-ring.instance-id=s_8061 \
  -store-gateway.sharding-ring.heartbeat-period=15s \
  -store-gateway.sharding-ring.heartbeat-timeout=1m \
  \
  -ring.store=etcd \
  -ring.prefix=cortex-ingesters/ \
  -ring.heartbeat-timeout=1m \
  -etcd.endpoints=10.120.119.1:2379 \
  -etcd.endpoints=10.120.119.2:2379 \
  -etcd.endpoints=10.120.119.3:2379 \
  -etcd.endpoints=10.120.119.4:2379 \
  -etcd.endpoints=10.120.119.5:2379 \
  \
  -ruler.enable-sharding=true \
  -ruler.sharding-strategy=shuffle-sharding \
  -ruler.tenant-shard-size=2 \
  -ruler.ring.store=etcd \
  -ruler.ring.prefix=cortex-rulers/ \
  -ruler.ring.num-tokens=32 \
  -ruler.ring.heartbeat-period=15s \
  -ruler.ring.heartbeat-timeout=1m \
  -ruler.ring.etcd.endpoints=10.120.119.1:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.2:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.3:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.4:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.5:2379 \
  -ruler.ring.instance-id=s_8061 \
  -ruler.ring.instance-interface-names=e1 \
  \
  -ruler.max-rules-per-rule-group=500 \
  -ruler.max-rule-groups-per-tenant=5000 \
  \
  -ruler.external.url=staging-cortex-ruler.org \
  -ruler.client.grpc-max-recv-msg-size=104857600 \
  -ruler.client.grpc-max-send-msg-size=16777216 \
  -ruler.client.grpc-compression= \
  -ruler.client.grpc-client-rate-limit=0 \
  -ruler.client.grpc-client-rate-limit-burst=0 \
  -ruler.client.backoff-on-ratelimits=false \
  -ruler.client.backoff-min-period=500ms \
  -ruler.client.backoff-max-period=10s \
  -ruler.client.backoff-retries=5 \
  -ruler.evaluation-interval=15s \
  -ruler.poll-interval=15s \
  -ruler.rule-path=/local/cortex-ruler/rules \
  -ruler.alertmanager-url=https://staging-cortex-alertmanager.org/alertmanager \
  -ruler.alertmanager-discovery=false \
  -ruler.alertmanager-refresh-interval=1m \
  -ruler.notification-queue-capacity=10000 \
  -ruler.notification-timeout=10s \
  -ruler.flush-period=1m \
  -experimental.ruler.enable-api=true \
  \
  -ruler-storage.backend=s3 \
  -ruler-storage.s3.endpoint=s3.org:10444 \
  -ruler-storage.s3.bucket-name=staging-rules \
  -ruler-storage.s3.insecure=false \
  \
  -target=ruler

configuration file for cortex alertmanager:

ExecStart=/usr/sbin/cortex-1.17.1 \
  -auth.enabled=true \
  -log.level=info \
  -config.file=/etc/cortex-alertmanager-8071/cortex-alertmanager.yaml \
  -runtime-config.file=/etc/cortex-shared/cortex-runtime.yaml \
  -server.http-listen-port=8071 \
  -server.grpc-listen-port=9071 \
  -server.grpc-max-recv-msg-size-bytes=104857600 \
  -server.grpc-max-send-msg-size-bytes=104857600 \
  -server.grpc-max-concurrent-streams=1000 \
  \
  -alertmanager.storage.path=/local/cortex-alertmanager-8071/data \
  -alertmanager.storage.retention=120h \
  -alertmanager.web.external-url=https://staging-cortex-alertmanager.org/alertmanager \
  -alertmanager.configs.poll-interval=1m \
  -experimental.alertmanager.enable-api=true \
  \
  -alertmanager.sharding-enabled=true \
  -alertmanager.sharding-ring.store=etcd \
  -alertmanager.sharding-ring.prefix=cortex-alertmanagers/ \
  -alertmanager.sharding-ring.heartbeat-period=15s \
  -alertmanager.sharding-ring.heartbeat-timeout=1m \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.1:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.2:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.3:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.4:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.5:2379 \
  -alertmanager.sharding-ring.instance-id=b_8071 \
  -alertmanager.sharding-ring.instance-interface-names=e1 \
  -alertmanager.sharding-ring.replication-factor=2 \
  -alertmanager.sharding-ring.zone-awareness-enabled=true \
  -alertmanager.sharding-ring.instance-availability-zone=t1 \
  \
  -alertmanager-storage.backend=s3 \
  -alertmanager-storage.s3.endpoint=s3.org:10444 \
  -alertmanager-storage.s3.bucket-name=staging-alerts \
  -alertmanager-storage.s3.insecure=false \
  \
  -alertmanager.receivers-firewall-block-cidr-networks=10.163.131.164/28,10.163.131.180/28 \
  -alertmanager.receivers-firewall-block-private-addresses=true \
  -alertmanager.notification-rate-limit=0 \
  -alertmanager.max-config-size-bytes=0 \
  -alertmanager.max-templates-count=0 \
  -alertmanager.max-template-size-bytes=0 \
  \
  -target=alertmanager

the configuration for alertmanager:

template_files:
  default_template: |
    {{ define "__alertmanager" }}AlertManager{{ end }}
    {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }}
alertmanager_config: |
  global:
    smtp_smarthost: 'yourmailhost'
    smtp_from: 'youraddress'
    smtp_require_tls: false
  templates:
    - 'default_template'
  route:
    receiver: example-email
  receivers:
    - name: example-email
      email_configs:
      - to: 'youraddress'

mousimin avatar Jul 02 '24 09:07 mousimin

Hi @friedrichg & @yeya24 , I guess the error message "bad response status 422 Unprocessable Entity" was from altermanager, right? But I didn't find any error log from altermanager even I used debug log level, any suggestion from you will be appreciated!

mousimin avatar Jul 12 '24 01:07 mousimin

I want to answer myself so that the others can refer to it. I manually sent the HTTP request using curl and got the detailed response from alertmanager: maxFailure (quorum) on a given error family, rpc error: code = Code(422) desc = addr=10.120.131.81:9071 state=ACTIVE zone=z1, rpc error: code = Code(422) desc = {"code":601,"message":"0.generatorURL in body must be of type uri: \"staging-cortex-ruler.org/graph?g0.expr=up%7Bapp%3D%22cert-manager%22%7D+%3E+0\u0026g0.tab=1\""} So I added the schema "https://" at the beginning of the value -ruler.external.url and then it worked.

Map this to the code:

func (n *Manager) sendOne(ctx context.Context, c *http.Client, url string, b []byte) error {
	req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(b))
	if err != nil {
		return err
	}
	req.Header.Set("User-Agent", userAgent)
	req.Header.Set("Content-Type", contentTypeJSON)
	resp, err := n.opts.Do(ctx, c, req)
	if err != nil {
		return err
	}
	defer func() {
		io.Copy(io.Discard, resp.Body)
		resp.Body.Close()
	}()

	// Any HTTP status 2xx is OK.
	//nolint:usestdlibvars
	if resp.StatusCode/100 != 2 {
		return fmt.Errorf("bad response status %s", resp.Status)
	}

	return nil
}

Maybe we should add the response body into the error message as well? Currently we only add the status which is not easy for debugging.

mousimin avatar Jul 19 '24 12:07 mousimin

@friedrichg @yeya24 should we go ahead and start logging the body of the response? It makes sense IMHO.

rapphil avatar Jul 25 '24 18:07 rapphil

@rapphil Agree. Would you like to work on it? Just want to make sure AM doesn't send something crazy in the response body. Maybe we can truncate the message with a limit

yeya24 avatar Jul 25 '24 19:07 yeya24

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 26 '25 18:04 stale[bot]