datasophon [Bug] [Module Name] Bug title yarn宕机

Search before asking

[X] I had searched in the issues and found no similar issues.

What happened

yarn集群启动过一会宕机 yarnerr

What you expected to happen

不确定是不是3.3.6版本包漏改了什么

How to reproduce

1.2.1分支，官网下载来的Hadoop3.3.6版本包，我这一共做了以下处理：

cp /hadoop-3.3.3/etc/hadoop/fair-scheduler.xml /datasophon//hadoop-3.3.6/etc/hadoop/
cd /datasophon/hadoop-3.3.3/etc/hadoop/新增两个空文件blacklist whitelist 2.hdfs正常安装正常运行

3.yarn集群启动过一会宕机 yarnerr

日志显示并无报错 log

每次重启之后会显示上次的是kill -15 如：2024-07-12 15:58:57,029 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node ddp3:45454 cluster capacity: <memory:12144, vCores:6> 2024-07-12 16:02:34,726 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: RECEIVED SIGNAL 15: SIGTERM 2024-07-12 16:02:34,733 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted 2024-07-12 16:02:34,737 INFO org.eclipse.jetty.server.handler.ContextHandler: Stopped o.e.j.w.WebAppContext@516592b1{cluster,/,null,STOPPED}{jar:file:/datasophon/hadoop-3.3.6/share/hadoop/yarn/hadoop-yarn-common-3.3.6.jar!/webapps/cluster} 2024-07-12 16:02:34,742 INFO org.eclipse.jetty.server.AbstractConnector: Stopped ServerConnector@464a4442{HTTP/1.1, (http/1.1)}{ddp4:8088}

ps -ef 发现nn，nm的进程还在，并且yarn也能通过命令看到服务状态 [hdfs@ddp4 datasophon]$ yarn node -list -all 2024-07-12 16:52:47,106 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 Total Nodes:3 Node-Id Node-State Node-Http-Address Number-of-Running-Containers ddp4:45454 RUNNING ddp4:8042 0 ddp1:45454 RUNNING ddp1:8042 0 ddp3:45454 RUNNING ddp3:8042 0 [hdfs@ddp4 datasophon]$ yarn rmadmin -getAllServiceState ddp1:8033 standby
ddp4:8033 active
[hdfs@ddp4 datasophon]$ ping ddp1 PING ddp1 (xxxx) 56(84) bytes of data. 64 bytes from ddp1 (xxxx): icmp_seq=1 ttl=64 time=16.6 ms 64 bytes from ddp1 (xxxx): icmp_seq=2 ttl=64 time=8.33 ms ^C --- ddp1 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 8.337/12.510/16.684/4.174 ms [hdfs@ddp4 datasophon]$ ping ddp3 PING ddp3 (1xxxx) 56(84) bytes of data. 64 bytes from ddp3 (xxxx): icmp_seq=1 ttl=64 time=1.72 ms 64 bytes from ddp3 (xxxx): icmp_seq=2 ttl=64 time=0.540 ms

rn 8088管理页面每一个tab都显示错误 8088-1

Anything else

No response

Version

main

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Jul 12 '24 09:07 yuhuang123456

You did not use DDP within the scope

Jul 15 '24 08:07 datasophon

您没有在范围内使用DDP 必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

Jul 15 '24 09:07 yuhuang123456

您没有在范围内使用DDP 必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

Aug 01 '24 12:08 luoyize

您没有在范围内使用DDP 必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

jmx从3.3.3复制过来了，但是prometheus_config.yml是空的，hdfs是正常监控的，yarn不行。

Aug 02 '24 01:08 yuhuang123456

您没有在范围内使用DDP 必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

jmx同级还有ranger-hdfs-plugin目录也复制过来了，表象是hdfs能正常上传文件，mapreduce示例也可以执行

Aug 02 '24 01:08 yuhuang123456

您没有在范围内使用DDP 必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

jmx同级还有ranger-hdfs-plugin目录也复制过来了，表象是hdfs能正常上传文件，mapreduce示例也可以执行

检查你得yarn-evn.sh是否配置jmx，然后检查你的Prometheus里的configs下面是否有nodemanager的配置，如果都有的话，检查你得yarn进程是否是你新启动的，还是之前安装过的遗留进程

Aug 02 '24 03:08 luoyize

您没有在范围内使用DDP 必须要使用3.3.3版本的吗。我看有人好像用了3.3.6。

没有添加jmx监控，可能会监控不到状态，显示报错

jmx同级还有ranger-hdfs-plugin目录也复制过来了，表象是hdfs能正常上传文件，mapreduce示例也可以执行

检查你得yarn-evn.sh是否配置jmx，然后检查你的Prometheus里的configs下面是否有nodemanager的配置，如果都有的话，检查你得yarn进程是否是你新启动的，还是之前安装过的遗留进程

感谢

Aug 02 '24 10:08 yuhuang123456