hertzbeat icon indicating copy to clipboard operation
hertzbeat copied to clipboard

[BUG] <title>v1.3.2 alarm convergence function is very good, but further adjustment is needed to avoid receiving repeated alarms

Open macaty opened this issue 1 year ago • 11 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

一、假如告警记录: 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:02 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常 1:03 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:04 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:05 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:06 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:07 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:08 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:09 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:10 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:11 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:12 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:13 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:14 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:15 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常

二、在告警收敛功能里面,如果配置了5分钟的收敛,效果不好,结果如下 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:02 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常 1:03 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:08 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:13 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常

Expected Behavior

三、告警降噪:如果配置判断上次一次告警是否相同,如果相同不要发告警通知了,结果如下 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:02 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常 1:03 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常

Steps To Reproduce

四、建议新增,告警降噪功能,能够更精确的处理告警通知,避免收到重复告警

Environment

HertzBeat version(v1.3.2):

Debug logs

none

Anything else?

none

macaty avatar Jul 10 '23 10:07 macaty

这里的告警收敛就是在指定周期时间范围内,对相同重复告警进行收敛去重 我看你设的是5分钟,你可以把时间设置大些

tomsun28 avatar Jul 10 '23 11:07 tomsun28

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The alarm convergence here is to converge and deduplicate the same repeated alarms within the specified cycle time range I think you set it to 5 minutes, you can set the time larger

hertzbeat avatar Jul 10 '23 11:07 hertzbeat

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The alarm convergence here is to converge and deduplicate the same repeated alarms within the specified cycle time range. I think you set it to 5 minutes. You can set the time to be larger

It is useless to set it longer, and the alarm in the middle of recovery will be ignored

hertzbeat avatar Jul 11 '23 00:07 hertzbeat

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


For example, set more than 10 minutes

hertzbeat avatar Jul 11 '23 00:07 hertzbeat

这里的告警收敛就是在指定周期时间范围内,对相同重复告警进行收敛去重 我看你设的是5分钟,你可以把时间设置大些

例如设置超15分钟,就会变成如下,中间恢复的情况看不到 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常

还会导致一个问题,如果超过聚合时间,相同告警依然发送,不断循环往复

macaty avatar Jul 11 '23 00:07 macaty

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The alarm convergence here is to converge and deduplicate the same repeated alarms within the specified cycle time range. I think you set it to 5 minutes. You can set the time to be larger

For example, if the setting exceeds 15 minutes, it will become as follows, and the recovery in the middle cannot be seen 1:01 Monitoring availability critical alert api monitoring availability alert, code is UN_CONNECTABLE 1:16 Monitoring Availability Warning Alarm Availability Alarm Recovery Notification, monitoring status has returned to normal

It will also cause a problem. If the aggregation time is exceeded, the same alarm will still be sent, and the cycle will continue

hertzbeat avatar Jul 11 '23 00:07 hertzbeat

image image 从这个角度看,目前的告警收敛不够友好,就建议如下 1、监控项优化: -->✅【已有功能】请求超时设置,目前已经有;失败请 -->🔴【目前没有】重新尝试次数,允许设置多个,减少告警的flaping 2、告警降噪 -->✅【已有功能】告警聚合,目前是针对时间段进行聚合,也就是相同时间段内告警聚合一条,可以减少告警flapping,存在问题:中间有恢复告警相关告警会忽略,超过聚合时间,依然会重复告警出现,对告警处理的人员依然不够友好。 -->🔴【目前没有】告警降噪:只有真正状态变化,才会告警。

macaty avatar Jul 11 '23 00:07 macaty

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


image image From this perspective, the current alarm convergence is not friendly enough

hertzbeat avatar Jul 11 '23 00:07 hertzbeat

I can work on this

l646505418 avatar Jul 11 '23 14:07 l646505418

例如设置超15分钟,就会变成如下,中间恢复的情况看不到 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常

还会导致一个问题,如果超过聚合时间,相同告警依然发送,不断循环往复

是的 感谢建议,感觉我们得还是再理理这块的设计 或者 再参考下其它平台的设计。 对于重复持续性告警,有些用户的需求是不用每次都发导致频繁告警,收敛到每隔4小时发一次,如果4小时之后会有,就再发一次,而不是以后就不发了。 参考华为云: image

tomsun28 avatar Jul 11 '23 14:07 tomsun28

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


For example, if it is set for more than 15 minutes, it will become as follows, and the situation of recovery in the middle cannot be seen. 1:01 Monitoring availability emergency alarm api monitoring availability alert, code is UN_CONNECTABLE 1:16 Monitoring availability warning alarm availability alarm recovery notification, monitoring status has been restored Back to normal

It will also cause a problem. If the aggregation time is exceeded, the same alarm will still be sent, and the cycle will continue

Yes, thanks for the suggestion, I feel that we have to deal with the design of this piece or refer to the design of other platforms. For repeated and persistent alarms, some users do not need to send frequent alarms every time, but converge to send them every 4 hours. If there are more alarms after 4 hours, send them again instead of not sending them in the future. Refer to HUAWEI CLOUD: image

hertzbeat avatar Jul 11 '23 14:07 hertzbeat