hbase icon indicating copy to clipboard operation
hbase copied to clipboard

HBASE-27763 Recover WAL encounter KeeperErrorCode = NoNode cause Regi…

Open gottagogottagoGxj opened this issue 2 years ago • 3 comments

…onServer crash

gottagogottagoGxj avatar Apr 13 '23 11:04 gottagogottagoGxj

:broken_heart: -1 overall

Vote Subsystem Runtime Comment
+0 :ok: reexec 0m 24s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+1 :green_heart: hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 :green_heart: mvninstall 3m 54s master passed
+1 :green_heart: compile 2m 34s master passed
+1 :green_heart: checkstyle 0m 36s master passed
+1 :green_heart: spotless 0m 43s branch has no errors when running spotless:check.
+1 :green_heart: spotbugs 1m 30s master passed
_ Patch Compile Tests _
+1 :green_heart: mvninstall 3m 35s the patch passed
+1 :green_heart: compile 2m 31s the patch passed
+1 :green_heart: javac 2m 31s the patch passed
-0 :warning: checkstyle 0m 34s hbase-server: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 :green_heart: whitespace 0m 0s The patch has no whitespace issues.
+1 :green_heart: hadoopcheck 13m 22s Patch does not cause any errors with Hadoop 3.2.4 3.3.4.
-1 :x: spotless 0m 36s patch has 53 errors when running spotless:check, run spotless:apply to fix.
-1 :x: spotbugs 1m 41s hbase-server generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
_ Other Tests _
+1 :green_heart: asflicense 0m 10s The patch does not generate ASF License warnings.
40m 13s
Reason Tests
FindBugs module:hbase-server
Sequence of calls to java.util.concurrent.ConcurrentHashMap may not be atomic in org.apache.hadoop.hbase.replication.regionserver.RecoveredReplicationSource.startShipperWorks() At RecoveredReplicationSource.java:may not be atomic in org.apache.hadoop.hbase.replication.regionserver.RecoveredReplicationSource.startShipperWorks() At RecoveredReplicationSource.java:[line 180]
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5177/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR https://github.com/apache/hbase/pull/5177
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux d9c601b0220d 5.4.0-1093-aws #102~18.04.2-Ubuntu SMP Wed Dec 7 00:31:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / a71105997f
Default Java Eclipse Adoptium-11.0.17+8
checkstyle https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5177/1/artifact/yetus-general-check/output/diff-checkstyle-hbase-server.txt
spotless https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5177/1/artifact/yetus-general-check/output/patch-spotless.txt
spotbugs https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5177/1/artifact/yetus-general-check/output/new-spotbugs-hbase-server.html
Max. process+thread count 82 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5177/1/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase avatar Apr 13 '23 12:04 Apache-HBase

Mind explaining more on how do we fix the no node exception?

Apache9 avatar Apr 13 '23 15:04 Apache9

Hi @gottagogottagoGxj, appreciate if you could give some more explain about this ticket and your HBase version.

Seems I met this issue too, on HBase 2.4.11

Here is my log:

2024-03-21 16:19:43,379 WARN  [ReplicationExecutor-0.replicationSource,xxxxx,1705567104078.replicationSource.shipper000.000.000.000%2C16020%2C1705567104078.000.000.000.000%2C16020%2C1705567104078.regiongroup-1,xxxxx,1705567104078] regionserver.ReplicationSourceShipper: com.shopee.di.foundation.hbase.KafkaInterClusterReplicationEndpoint threw unknown exception:
java.util.ConcurrentModificationException
        at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1221)
        at org.apache.hadoop.hbase.replication.regionserver.MetricsSource.updateTableLevelMetrics(MetricsSource.java:112)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:215)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:117)
2024-03-21 16:19:43,405 ERROR [ReplicationExecutor-0.replicationSource,xxxxx,1705567104078.replicationSource.shipper000.000.000.000%2C16020%2C1705567104078.000.000.000.000%2C16020%2C1705567104078.regiongroup-1,xxxxx,1705567104078] regionserver.HRegionServer: ***** ABORTING region server ip-10-80-163-145.idata-server.shopee.io,16020,1704705566934: Failed to operate on replication queue *****
org.apache.hadoop.hbase.replication.ReplicationException: Failed to set log position (serverName=xxxxx,1704705566934, queueId=xxxxx,1705567104078, fileName=000.000.000.000%2C16020%2C1705567104078.000.000.000.000%2C16020%2C1705567104078.regiongroup-1.1711008927746, position=130724689)
        at org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.setWALPosition(ZKReplicationQueueStorage.java:255)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.lambda$logPositionAndCleanOldLogs$8(ReplicationSourceManager.java:552)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.interruptOrAbortWhenFail(ReplicationSourceManager.java:500)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:551)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceInterface.logPositionAndCleanOldLogs(ReplicationSourceInterface.java:206)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.updateLogPosition(ReplicationSourceShipper.java:264)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.shipEdits(ReplicationSourceShipper.java:203)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceShipper.run(ReplicationSourceShipper.java:117)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1925)
        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1830)
        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:658)
        at org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1534)
        at org.apache.hadoop.hbase.replication.ZKReplicationQueueStorage.setWALPosition(ZKReplicationQueueStorage.java:245)
        ... 7 more

*Desensitized information such as servername and IP.

Thank you.

thangTang avatar Mar 21 '24 09:03 thangTang