hertzbeat icon indicating copy to clipboard operation
hertzbeat copied to clipboard

IPMI Monitoring Data Interruption Issue

Open sdlwdong opened this issue 9 months ago • 9 comments

Question

Title:IPMI Monitoring Data Interruption Issue

Description: After successfully configuring and establishing a normal monitoring connection for the physical machine via IPMI, we encountered an issue where data collection is interrupted after a period of time.

  1. The network connection remains stable, with no apparent abnormalities.
  2. We can successfully retrieve IPMI information using commands directly on the Hertzbeat machine.
  3. However, when clicking on "Edit Test," the connection fails with a timeout error.

This suggests there may be an underlying issue with the IPMI integration or Hertzbeat's ability to maintain the connection over time.

please help me! thanks.

标题:IPMI监控数据中断问题

描述: 通过IPMI成功配置并建立物理机的正常监控连接后,我们发现数据采集在一段时间后中断。

  1. 网络连接保持稳定,没有明显异常。
  2. 我们可以通过在Hertzbeat机器上直接使用命令成功获取IPMI信息。
  3. 然而,当点击“编辑测试”时,连接失败,提示超时错误。

请帮忙解决,谢谢。 Image

sdlwdong avatar Mar 30 '25 09:03 sdlwdong

Image

sdlwdong avatar Mar 30 '25 09:03 sdlwdong

2025-03-30 17:16:19.654 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743326179654 code: TIMEOUT msg: "collect timeout"

2025-03-30 17:20:39.656 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743326439656 code: TIMEOUT msg: "collect timeout"

2025-03-30 17:24:59.658 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743326699658 code: TIMEOUT msg: "collect timeout"

2025-03-30 17:29:19.660 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743326959660 code: TIMEOUT msg: "collect timeout"

2025-03-30 17:33:39.663 [metrics-task-timeout-monitor-0] ERROR org.apache.hertzbeat.collector.dispatch.CommonDispatcher Line:168 - [Collect Timeout]: id: 494961833735936 app: "ipmi" metrics: "Chassis" time: 1743327219663 code: TIMEOUT msg: "collect timeout"

sdlwdong avatar Mar 30 '25 09:03 sdlwdong

hi @sdlwdong is there more log information? The current doesn't seem to tell what the problem is.

hi @gjjjj0101 please help take a look if have time, thanks.

tomsun28 avatar Mar 31 '25 08:03 tomsun28

hi @sdlwdong is there more log information? The current doesn't seem to tell what the problem is.

hi @gjjjj0101 please help take a look if have time, thanks.

@gjjjj0101 Hello, which service's log do you need to see? Please guide me. Thank you ! 您好,需要看哪个服务的日志?请指导一下谢谢。

sdlwdong avatar Apr 01 '25 06:04 sdlwdong

I have located the problem now. When there is a problem with the communication network between the collector and the machine, the datagramChannel.receive() of nio used in the collector will not throw a network timeout exception, causing the manager's collection to time out. Therefore, the status is still up and the collection time is the earliest correct collection time.

gjjjj0101 avatar Apr 01 '25 06:04 gjjjj0101

So this is a bug, I am still designing how to solve it, if you have good suggestions please share with me.

gjjjj0101 avatar Apr 01 '25 06:04 gjjjj0101

The solutions can be: 1.Network Configuration Check:

# Verify network connectivity to BMC
ping <BMC_IP> -t  # Continuous ping test
# Check for packet loss
mtr --report <BMC_IP>

2.IPMI Tool Validation:

# Test raw IPMI connectivity during failure periods
ipmitool -H <BMC_IP> -U <username> -P <password> -I lanplus chassis status

3.Hertzbeat Configuration Adjustments:

Increase timeout settings in hertzbeat.yml:

collector:
  dispatch:
    timeout: 30000  

harshita2626 avatar Apr 01 '25 07:04 harshita2626

Since no exception like a timeout is thrown (as it’s using UDP), how about manually setting a specific timeout? If there’s no response within a certain period, we could treat it as a failed request.

JuJinPark avatar Apr 03 '25 05:04 JuJinPark

V1.70 遇到了同样的问题。

lswadmin avatar Apr 09 '25 06:04 lswadmin