hbase icon indicating copy to clipboard operation
hbase copied to clipboard

HBASE-28584 RS SIGSEGV under heavy replication load

Open apurtell opened this issue 1 year ago • 4 comments

Clone the cells that are used to apply mutations on the local cluster. Some operations may still be in flight even as we fail to apply some other in-flight mutations and trigger failure handling including a release of the buffer underlying the cellScanner that is sourcing the cells.

apurtell avatar Jul 28 '24 01:07 apurtell

See also https://issues.apache.org/jira/browse/HBASE-28584?focusedCommentId=17869030&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17869030

apurtell avatar Jul 28 '24 01:07 apurtell

:confetti_ball: +1 overall

Vote Subsystem Runtime Logfile Comment
+0 :ok: reexec 0m 41s Docker mode activated.
_ Prechecks _
+1 :green_heart: dupname 0m 0s No case conflicting files found.
+0 :ok: codespell 0m 0s codespell was not available.
+0 :ok: detsecrets 0m 0s detect-secrets was not available.
+1 :green_heart: @author 0m 0s The patch does not contain any @author tags.
+1 :green_heart: hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+0 :ok: mvndep 0m 17s Maven dependency ordering for branch
+1 :green_heart: mvninstall 4m 22s master passed
+1 :green_heart: compile 4m 30s master passed
+1 :green_heart: checkstyle 1m 8s master passed
+1 :green_heart: spotbugs 2m 42s master passed
+1 :green_heart: spotless 0m 59s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+0 :ok: mvndep 0m 11s Maven dependency ordering for patch
+1 :green_heart: mvninstall 3m 36s the patch passed
+1 :green_heart: compile 4m 16s the patch passed
+1 :green_heart: javac 4m 16s the patch passed
+1 :green_heart: blanks 0m 0s The patch has no blanks issues.
+1 :green_heart: checkstyle 1m 4s the patch passed
+1 :green_heart: spotbugs 2m 46s the patch passed
+1 :green_heart: hadoopcheck 12m 56s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 :green_heart: spotless 0m 45s patch has no errors when running spotless:check.
_ Other Tests _
+1 :green_heart: asflicense 0m 21s The patch does not generate ASF License warnings.
48m 41s
Subsystem Report/Notes
Docker ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6124/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR https://github.com/apache/hbase/pull/6124
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 30055f172d95 5.4.0-182-generic #202-Ubuntu SMP Fri Apr 26 12:29:36 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / e0a31621f67dec9c6da8a0607e9bdb28783afca5
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 84 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6124/1/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase avatar Jul 28 '24 02:07 Apache-HBase

OK, in ServerCall class, we have a retainByWAL method, where we will count the extra references of the ServerCall, mainly the CellScanners.

I think we can just change the method to retain, which means we want to retain it for other usage even after the rpc call is done, and also use it here.

Anyway, since the current PR has been well tested in producation, I think we can apply it first, and open another issue for optimizing.

Thanks.

Apache9 avatar Jul 28 '24 06:07 Apache9

:confetti_ball: +1 overall

Vote Subsystem Runtime Logfile Comment
+0 :ok: reexec 4m 11s Docker mode activated.
-0 :warning: yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+0 :ok: mvndep 0m 9s Maven dependency ordering for branch
+1 :green_heart: mvninstall 4m 33s master passed
+1 :green_heart: compile 1m 43s master passed
+1 :green_heart: javadoc 1m 15s master passed
+1 :green_heart: shadedjars 6m 12s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+0 :ok: mvndep 0m 13s Maven dependency ordering for patch
+1 :green_heart: mvninstall 3m 0s the patch passed
+1 :green_heart: compile 1m 21s the patch passed
+1 :green_heart: javac 1m 21s the patch passed
+1 :green_heart: javadoc 0m 47s the patch passed
+1 :green_heart: shadedjars 5m 20s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 :green_heart: unit 2m 29s hbase-common in the patch passed.
+1 :green_heart: unit 229m 24s hbase-server in the patch passed.
265m 42s
Subsystem Report/Notes
Docker ClientAPI=1.46 ServerAPI=1.46 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6124/1/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR https://github.com/apache/hbase/pull/6124
Optional Tests javac javadoc unit compile shadedjars
uname Linux 265f2a29681d 5.4.0-177-generic #197-Ubuntu SMP Thu Mar 28 22:45:47 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / e0a31621f67dec9c6da8a0607e9bdb28783afca5
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6124/1/testReport/
Max. process+thread count 5172 (vs. ulimit of 30000)
modules C: hbase-common hbase-server U: .
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6124/1/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase avatar Jul 28 '24 06:07 Apache-HBase

Any updates here?

Thanks.

Apache9 avatar Sep 04 '24 14:09 Apache9

I think we should fix this in newer releases.

If you are all OK, I could try to implement the reference counting way to solve the problem.

@apurtell @virajjasani Thoughts?

Thanks.

Apache9 avatar Sep 17 '24 14:09 Apache9

I thought refcounting would be complex but am not opposed to it as a different solution. When and if we have that we could remove the copying.

apurtell avatar Sep 17 '24 16:09 apurtell

We would not need this change if https://github.com/apache/hbase/pull/6263 solves the problem instead.

apurtell avatar Sep 18 '24 19:09 apurtell

Fixed by https://github.com/apache/hbase/pull/6263

apurtell avatar Sep 20 '24 00:09 apurtell