hbase icon indicating copy to clipboard operation
hbase copied to clipboard

HBASE-28158 Decouple RIT list management from TRSP invocation

Open apurtell opened this issue 1 year ago • 8 comments

Only remove a region from the RIT map when assignment reaches a suitable terminal state.

Before this change we would map a RIT list entry to a running TRSP. This is problematic because the in-transition state does not always correspond to the lifetime of a TRSP. In cases of an operator bypass or bug or timeout the region will still be in transition from the user or operator perspective but not included in the RIT list.

Operators expect that regions that should be open but are not appear the master's RIT list, provided by /rits.jsp, the output of the shell's 'rit' command, and in ClusterStatus.

apurtell avatar Oct 17 '23 01:10 apurtell

:confetti_ball: +1 overall

Vote Subsystem Runtime Comment
+0 :ok: reexec 0m 35s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+1 :green_heart: hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 :green_heart: mvninstall 4m 17s master passed
+1 :green_heart: compile 3m 6s master passed
+1 :green_heart: checkstyle 0m 42s master passed
+1 :green_heart: spotless 0m 51s branch has no errors when running spotless:check.
+1 :green_heart: spotbugs 1m 46s master passed
_ Patch Compile Tests _
+1 :green_heart: mvninstall 3m 25s the patch passed
+1 :green_heart: compile 3m 13s the patch passed
+1 :green_heart: javac 3m 13s the patch passed
+1 :green_heart: checkstyle 0m 42s the patch passed
+1 :green_heart: whitespace 0m 0s The patch has no whitespace issues.
+1 :green_heart: hadoopcheck 12m 30s Patch does not cause any errors with Hadoop 3.2.4 3.3.6.
+1 :green_heart: spotless 1m 8s patch has no errors when running spotless:check.
+1 :green_heart: spotbugs 2m 29s the patch passed
_ Other Tests _
+1 :green_heart: asflicense 0m 13s The patch does not generate ASF License warnings.
43m 56s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR https://github.com/apache/hbase/pull/5464
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux 625f8eec7b40 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 391dfda6ad
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 78 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase avatar Oct 17 '23 02:10 Apache-HBase

So the intention here is that, when region is in abnormally closed state, we will use operator tools to bypass the TRSP, but then, we will consider the region as not in RIT and confusing the operators?

Seems reasonable. But we should be careful to not mess up the normal logic. Let me take a look at the related code carefully.

Thanks.

Apache9 avatar Oct 17 '23 03:10 Apache9

Based on my local tests there are three test failures to resolve.

This first one is a false positive:

  • TestAssignmentManagerUtil.testCreateUnassignProceduresForMergeFail

These two are timeouts, so waiters need changes:

  • TestAssignmentManagerMetrics.testRITAssignmentManagerMetrics
  • TestRegionReplicaSplit.testAssignFakeReplicaRegion

I think these are test issues but will look in detail tomorrow. Let's see if precommit here has more.

apurtell avatar Oct 17 '23 04:10 apurtell

:broken_heart: -1 overall

Vote Subsystem Runtime Comment
+0 :ok: reexec 0m 30s Docker mode activated.
-0 :warning: yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 :green_heart: mvninstall 2m 32s master passed
+1 :green_heart: compile 0m 36s master passed
+1 :green_heart: shadedjars 5m 6s branch has no errors when building our shaded downstream artifacts.
+1 :green_heart: javadoc 0m 24s master passed
_ Patch Compile Tests _
+1 :green_heart: mvninstall 2m 11s the patch passed
+1 :green_heart: compile 0m 36s the patch passed
+1 :green_heart: javac 0m 36s the patch passed
+1 :green_heart: shadedjars 5m 6s patch has no errors when building our shaded downstream artifacts.
+1 :green_heart: javadoc 0m 22s the patch passed
_ Other Tests _
-1 :x: unit 226m 57s hbase-server in the patch failed.
248m 21s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR https://github.com/apache/hbase/pull/5464
Optional Tests javac javadoc unit shadedjars compile
uname Linux 6bb6c17b8131 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 391dfda6ad
Default Java Temurin-1.8.0_352-b08
unit https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/testReport/
Max. process+thread count 4636 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase avatar Oct 17 '23 06:10 Apache-HBase

:broken_heart: -1 overall

Vote Subsystem Runtime Comment
+0 :ok: reexec 0m 49s Docker mode activated.
-0 :warning: yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 :green_heart: mvninstall 4m 3s master passed
+1 :green_heart: compile 1m 4s master passed
+1 :green_heart: shadedjars 6m 0s branch has no errors when building our shaded downstream artifacts.
+1 :green_heart: javadoc 0m 32s master passed
_ Patch Compile Tests _
+1 :green_heart: mvninstall 3m 27s the patch passed
+1 :green_heart: compile 1m 9s the patch passed
+1 :green_heart: javac 1m 9s the patch passed
+1 :green_heart: shadedjars 6m 21s patch has no errors when building our shaded downstream artifacts.
+1 :green_heart: javadoc 0m 35s the patch passed
_ Other Tests _
-1 :x: unit 240m 1s hbase-server in the patch failed.
268m 27s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR https://github.com/apache/hbase/pull/5464
Optional Tests javac javadoc unit shadedjars compile
uname Linux 5dc037f38998 5.4.0-163-generic #180-Ubuntu SMP Tue Sep 5 13:21:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / 391dfda6ad
Default Java Eclipse Adoptium-11.0.17+8
unit https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/testReport/
Max. process+thread count 4250 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5464/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase avatar Oct 17 '23 06:10 Apache-HBase

Yes I have legit test failures to resolve, doing that soon.

apurtell avatar Oct 18 '23 23:10 apurtell

For me I think the fundamental issue is the time a region is in transition is not always tied to the lifetime of a TransitRegionStateProcedure, so managing the RIT list from addProcedure/removeProcedure is not correct.

A region is in transition if it is not assigned. In our production we have had regions in transition and no TransitRegionStateProcedure either alive or scheduled for the region. That definitely happens if the TRSP is bypassed. It can also happen due to code bug. If the region is not assigned, it is still in transition, and should be reported as such by the master.

I can agree to totally decouple RIT list management from addProcedure/removeProcedure. That makes sense. The assignment manager will make separate calls to RegionStates to manage the RIT map, independently from management of procedures. How does that sound?

apurtell avatar Oct 26 '23 00:10 apurtell

For me I think the fundamental issue is the time a region is in transition is not always tied to the lifetime of a TransitRegionStateProcedure, so managing the RIT list from addProcedure/removeProcedure is not correct.

But if there is no issue, I do not think a region can be in the RIT state when there is no TRSP? I.e, only if you use HBCK to bypass a TRSP, but I do not think this is a normal operation. That why I say we'd better separate the normal logic and abnormal logic. And if a TRSP finishes, I think the region should leave the RIT state.

Apache9 avatar Oct 26 '23 06:10 Apache9