IPMI Monitoring Data Interruption Issue
Question
Title:IPMI Monitoring Data Interruption Issue
Description: After successfully configuring and establishing a normal monitoring connection for the physical machine via IPMI, we encountered an issue where data collection is interrupted after a period of time.
- The network connection remains stable, with no apparent abnormalities.
- We can successfully retrieve IPMI information using commands directly on the Hertzbeat machine.
- However, when clicking on "Edit Test," the connection fails with a timeout error.
This suggests there may be an underlying issue with the IPMI integration or Hertzbeat's ability to maintain the connection over time.
please help me! thanks.
标题:IPMI监控数据中断问题
描述: 通过IPMI成功配置并建立物理机的正常监控连接后,我们发现数据采集在一段时间后中断。
- 网络连接保持稳定,没有明显异常。
- 我们可以通过在Hertzbeat机器上直接使用命令成功获取IPMI信息。
- 然而,当点击“编辑测试”时,连接失败,提示超时错误。
请帮忙解决,谢谢。
2025-03-30 17:16:19.654 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743326179654 code: TIMEOUT msg: "collect timeout"
2025-03-30 17:20:39.656 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743326439656 code: TIMEOUT msg: "collect timeout"
2025-03-30 17:24:59.658 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743326699658 code: TIMEOUT msg: "collect timeout"
2025-03-30 17:29:19.660 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743326959660 code: TIMEOUT msg: "collect timeout"
2025-03-30 17:33:39.663 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743327219663 code: TIMEOUT msg: "collect timeout"
hi @sdlwdong is there more log information? The current doesn't seem to tell what the problem is.
hi @gjjjj0101 please help take a look if have time, thanks.
hi @sdlwdong is there more log information? The current doesn't seem to tell what the problem is.
hi @gjjjj0101 please help take a look if have time, thanks.
@gjjjj0101 Hello, which service's log do you need to see? Please guide me. Thank you ! 您好,需要看哪个服务的日志?请指导一下谢谢。
I have located the problem now. When there is a problem with the communication network between the collector and the machine, the datagramChannel.receive() of nio used in the collector will not throw a network timeout exception, causing the manager's collection to time out. Therefore, the status is still up and the collection time is the earliest correct collection time.
So this is a bug, I am still designing how to solve it, if you have good suggestions please share with me.
The solutions can be: 1.Network Configuration Check:
# Verify network connectivity to BMC
ping <BMC_IP> -t # Continuous ping test
# Check for packet loss
mtr --report <BMC_IP>
2.IPMI Tool Validation:
# Test raw IPMI connectivity during failure periods
ipmitool -H <BMC_IP> -U <username> -P <password> -I lanplus chassis status
3.Hertzbeat Configuration Adjustments:
Increase timeout settings in hertzbeat.yml:
collector:
dispatch:
timeout: 30000
Since no exception like a timeout is thrown (as it’s using UDP), how about manually setting a specific timeout? If there’s no response within a certain period, we could treat it as a failed request.
V1.70 遇到了同样的问题。