ignite-3
ignite-3 copied to clipboard
IGNITE-21805 Refactor TableManager and move all RAFT related pieces to Replica
The goal
The goal of this PR is to remove RaftManager
from TableManager
and place it and its calls to ReplicaManager
.
The current issues
The main issues now related to the TableManager
's code in lines 967-993
:
- The ordering of internal table's update and replica creation-starting is important.
- Internal table's update should be proceed in any case, while replica should be started only if commented out condition on lines
971-973
isn't true.
Related tests failures:
org.apache.ignite.internal.table.distributed.TableManagerRecoveryTest
Probably, the reason of failure for two following tests is null
somewhere around ReplicaManager:L679
.
testTableIgnoredOnRecovery
Caused by: java.lang.NullPointerException
at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:992) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:868) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$33(TableManager.java:967) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
... 4 more
testTableStartedOnRecovery
Caused by: java.lang.NullPointerException
at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:992) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:868) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$33(TableManager.java:967) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
... 4 more
org.apache.ignite.internal.rebalance.ItRebalanceDistributedTest
testRebalanceWithTheSameNodes
The reason of failure is point 2 of main issues: we should start replica only once per node
org.mockito.exceptions.verification.TooManyActualInvocations:
replicaManager.startReplica(
<any>,
<any>,
<any java.util.function.Function>,
<any>
);
Wanted 1 time:
-> at org.apache.ignite.internal.replicator.ReplicaManager.startReplica(ReplicaManager.java:583)
But was 3 times:
-> at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:976)
-> at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:976)
-> at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:976)
org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest
Both of failed tests testManualRebalanceIfPartitionIsLost
and testManualRebalanceIfMajorityIsLost
are unfamiliar and unclear now for me. The common reason is somewhat like:
java.lang.AssertionError: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
at org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78)
at org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35)
at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
at org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testManualRebalanceIfPartitionIsLost(ItDisasterRecoveryReconfigurationTest.java:229)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
at org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:74)
... 8 more
Caused by: java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
But I'm sure that isn't the root cause.
Thank you for submitting the pull request.
To streamline the review process of the patch and ensure better code quality we ask both an author and a reviewer to verify the following:
The Review Checklist
- [ ] Formal criteria: TC status, codestyle, mandatory documentation. Also make sure to complete the following:
- There is a single JIRA ticket related to the pull request.
- The web-link to the pull request is attached to the JIRA ticket.
- The JIRA ticket has the Patch Available state.
- The description of the JIRA ticket explains WHAT was made, WHY and HOW.
- The pull request title is treated as the final commit message. The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue. - [ ] Design: new code conforms with the design principles of the components it is added to.
- [ ] Patch quality: patch cannot be split into smaller pieces, its size must be reasonable.
- [ ] Code quality: code is clean and readable, necessary developer documentation is added if needed.
- [ ] Tests code quality: test set covers positive/negative scenarios, happy/edge cases. Tests are effective in terms of execution time and resources.
Notes
A comment about a test's fix there: before the ticket there wasn't .join()
on TableManager, but now there is and if startReplica()
returns null -- it fails with NPE
. In context of the test, RepelicaManager
is mocked and then, e.g. busyLock
is null and so on. Without the method mocking the result of startReplica()
is null and then .join()
faces NPE
that leads to TimeoutException
on the top of stacktrace. As a solution I just mocking startReplica()
that returns completed with null-value future instead just null
.