hertzbeat icon indicating copy to clipboard operation
hertzbeat copied to clipboard

[BUG] java.lang.OutOfMemoryError: GC overhead limit exceeded

Open leim opened this issue 1 year ago • 13 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

hertzbeat_oom.txt

Expected Behavior

正常运行

Steps To Reproduce

系统正常运行,过段时间后,系统崩溃,控制台无法登录,查看日志,发现异常 java.lang.OutOfMemoryError: GC overhead limit exceeded

Environment

HertzBeat version(s):1.4.4

Debug logs

hertzbeat_oom.txt

Anything else?

No response

leim avatar Apr 07 '24 01:04 leim

hi, thanks for feedback. There maybe has a heap dump file in logs directory, can you find and provide it.

tomsun28 avatar Apr 07 '24 01:04 tomsun28

hi, are you add nginx monitoring? What the monitors you add? see https://github.com/dromara/hertzbeat/pull/1476. You can upgrade the hertzbeat version 1.5.0 to try again.

tomsun28 avatar Apr 07 '24 02:04 tomsun28

  1. no, we do not have nginx monitoring.
  2. the heap dump file is 5.2GB and tar file is 1.2GB, is there some method i can use to provide this file ?
  3. i will try to upgrade to 1.5.0 and see if there is still some problem.

leim avatar Apr 07 '24 03:04 leim

below snapshot shows monitors we have added to hertzbeat

image

leim avatar Apr 07 '24 03:04 leim

the heap dump file is 5.2GB and tar file is 1.2GB, is there some method i can use to provide this file ?

hi, you can use the https://cowtransfer.com/ to provide if possibale.

tomsun28 avatar Apr 07 '24 06:04 tomsun28

make sure the hertzbeat-collector version and hertzbeat version is the same.

2024-04-04 03:43:56 [netty-server-worker-3] INFO  org.dromara.hertzbeat.manager.scheduler.netty.process.HeartbeatProcessor - the collector xxxxx-collector is not online.
2024-04-04 03:43:08 [netty-server-worker-0] ERROR org.dromara.hertzbeat.common.util.ProtoJsonUtil - Failed parsing JSON source: JsonReader at line 15 column 11 path $.fields[2].name to Json
com.google.protobuf.InvalidProtocolBufferException: Failed parsing JSON source: JsonReader at line 15 column 11 path $.fields[2].name to Json
        at com.google.protobuf.util.JsonFormat$ParserImpl.merge(JsonFormat.java:1345)
        at com.google.protobuf.util.JsonFormat$Parser.merge(JsonFormat.java:477)
        at org.dromara.hertzbeat.common.util.ProtoJsonUtil.toProtobuf(ProtoJsonUtil.java:56)
        at org.dromara.hertzbeat.manager.scheduler.netty.process.CollectCyclicDataResponseProcessor.handle(CollectCyclicDataResponseProcessor.java:20)
        at org.dromara.hertzbeat.remoting.netty.NettyRemotingAbstract.processRequestMsg(NettyRemotingAbstract.java:73)
        at org.dromara.hertzbeat.remoting.netty.NettyRemotingAbstract.processReceiveMsg(NettyRemotingAbstract.java:59)
        at org.dromara.hertzbeat.remoting.netty.NettyRemotingServer$NettyServerHandler.channelRead0(NettyRemotingServer.java:192)
        at org.dromara.hertzbeat.remoting.netty.NettyRemotingServer$NettyServerHandler.channelRead0(NettyRemotingServer.java:182)
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:336)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:280)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:336)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:308)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.google.gson.JsonParseException: Failed parsing JSON source: JsonReader at line 15 column 11 path $.fields[2].name to Json
        at com.google.gson.JsonParser.parseReader(JsonParser.java:89)
        at com.google.protobuf.util.JsonFormat$ParserImpl.merge(JsonFormat.java:1340)
        ... 41 common frames omitted

tomsun28 avatar Apr 08 '24 01:04 tomsun28

  1. yes, the versions are mismatch, we have collectors v1.5.0 and hertzbeat v1.4.4 , yesterday i have upgraded hertzbeat to v1.5.0
  2. i have uploaded dump file here, https://cowtransfer.com/s/1d8595cd6b0b44 点击链接查看 [ java_pid10.hprof.tar.gz ] ,或访问奶牛快传 cowtransfer.com 输入传输口令 cx3cwz 查看;

leim avatar Apr 08 '24 01:04 leim

after upgrade hertzbeat to v1.5.0, the error occurs again.

hertzbeat_oom_20240408.txt z21

https://cowtransfer.com/s/5de6df61a93648 点击链接查看 [ java_pid10_0408.hprof.tar.gz ] ,或访问奶牛快传 cowtransfer.com 输入传输口令 p88c52 查看;

leim avatar Apr 08 '24 02:04 leim

Got it!

tomsun28 avatar Apr 08 '24 02:04 tomsun28

hi how many monitor and cluster collector you add. We see lots of metrics data in memory.

Maybe you can use extern kafka queue instead of default inmemory queue in application.yml

common:
  queue:
    # memory or kafka
    type: memory
    # properties when queue type is kafka
    kafka:
      servers: 127.0.0.1:9092
      metrics-data-topic: async-metrics-data
      alerts-data-topic: async-alerts-data

tomsun28 avatar Apr 10 '24 07:04 tomsun28

we have one hertzbeat master node and nearly 20 edge collectors.

there is one edge collector named "az-10", i use this collector to take over all tasks used to be run by the master node, after two days it is running well and no error occurs.

next week i will try to use kafka queue insead of inmemory queue and see if there is still any problem.

image

leim avatar Apr 11 '24 08:04 leim

单独启动了一个edge节点,将原来运行于master节点上的探测任务全部迁移至edge节点,msater节点只承担报警任务,稳定运行大约12天,又出现了 GC overhead limit exceeded 异常,现已将master节点配置了kafka队列,系统已运行2天,暂时正常,我们会持续关注master节点的运行状态。

leim avatar Apr 26 '24 10:04 leim

将master节点配置kafka队列后,已稳定运行半个月,目前系统平稳,未出现其他异常。

leim avatar May 10 '24 05:05 leim