hertzbeat
hertzbeat copied to clipboard
[BUG] <title>v1.3.2 alarm convergence function is very good, but further adjustment is needed to avoid receiving repeated alarms
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
一、假如告警记录: 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:02 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常 1:03 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:04 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:05 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:06 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:07 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:08 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:09 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:10 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:11 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:12 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:13 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:14 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:15 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常
二、在告警收敛功能里面,如果配置了5分钟的收敛,效果不好,结果如下 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:02 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常 1:03 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:08 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:13 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常
Expected Behavior
三、告警降噪:如果配置判断上次一次告警是否相同,如果相同不要发告警通知了,结果如下 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:02 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常 1:03 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常
Steps To Reproduce
四、建议新增,告警降噪功能,能够更精确的处理告警通知,避免收到重复告警
Environment
HertzBeat version(v1.3.2):
Debug logs
none
Anything else?
none
这里的告警收敛就是在指定周期时间范围内,对相同重复告警进行收敛去重 我看你设的是5分钟,你可以把时间设置大些
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
The alarm convergence here is to converge and deduplicate the same repeated alarms within the specified cycle time range I think you set it to 5 minutes, you can set the time larger
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
The alarm convergence here is to converge and deduplicate the same repeated alarms within the specified cycle time range. I think you set it to 5 minutes. You can set the time to be larger
It is useless to set it longer, and the alarm in the middle of recovery will be ignored
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
For example, set more than 10 minutes
这里的告警收敛就是在指定周期时间范围内,对相同重复告警进行收敛去重 我看你设的是5分钟,你可以把时间设置大些
例如设置超15分钟,就会变成如下,中间恢复的情况看不到 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常
还会导致一个问题,如果超过聚合时间,相同告警依然发送,不断循环往复
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
The alarm convergence here is to converge and deduplicate the same repeated alarms within the specified cycle time range. I think you set it to 5 minutes. You can set the time to be larger
For example, if the setting exceeds 15 minutes, it will become as follows, and the recovery in the middle cannot be seen 1:01 Monitoring availability critical alert api monitoring availability alert, code is UN_CONNECTABLE 1:16 Monitoring Availability Warning Alarm Availability Alarm Recovery Notification, monitoring status has returned to normal
It will also cause a problem. If the aggregation time is exceeded, the same alarm will still be sent, and the cycle will continue
从这个角度看,目前的告警收敛不够友好,就建议如下
1、监控项优化:
-->✅【已有功能】请求超时设置,目前已经有;失败请
-->🔴【目前没有】重新尝试次数,允许设置多个,减少告警的flaping
2、告警降噪
-->✅【已有功能】告警聚合,目前是针对时间段进行聚合,也就是相同时间段内告警聚合一条,可以减少告警flapping,存在问题:中间有恢复告警相关告警会忽略,超过聚合时间,依然会重复告警出现,对告警处理的人员依然不够友好。
-->🔴【目前没有】告警降噪:只有真正状态变化,才会告警。
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
From this perspective, the current alarm convergence is not friendly enough
I can work on this
例如设置超15分钟,就会变成如下,中间恢复的情况看不到 1:01 监控可用性 紧急告警 api monitoring availability alert, code is UN_CONNECTABLE 1:16 监控可用性 警告告警 可用性告警恢复通知, 监控状态已恢复正常
还会导致一个问题,如果超过聚合时间,相同告警依然发送,不断循环往复
是的 感谢建议,感觉我们得还是再理理这块的设计 或者 再参考下其它平台的设计。
对于重复持续性告警,有些用户的需求是不用每次都发导致频繁告警,收敛到每隔4小时发一次,如果4小时之后会有,就再发一次,而不是以后就不发了。
参考华为云:
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
For example, if it is set for more than 15 minutes, it will become as follows, and the situation of recovery in the middle cannot be seen. 1:01 Monitoring availability emergency alarm api monitoring availability alert, code is UN_CONNECTABLE 1:16 Monitoring availability warning alarm availability alarm recovery notification, monitoring status has been restored Back to normal
It will also cause a problem. If the aggregation time is exceeded, the same alarm will still be sent, and the cycle will continue
Yes, thanks for the suggestion, I feel that we have to deal with the design of this piece or refer to the design of other platforms.
For repeated and persistent alarms, some users do not need to send frequent alarms every time, but converge to send them every 4 hours. If there are more alarms after 4 hours, send them again instead of not sending them in the future.
Refer to HUAWEI CLOUD: