ignite-3 icon indicating copy to clipboard operation
ignite-3 copied to clipboard

IGNITE-21805 Refactor TableManager and move all RAFT related pieces to Replica

Open JAkutenshi opened this issue 10 months ago • 1 comments

Apache JIRA ticket's link

The goal

The goal of this PR is to remove RaftManager from TableManager and place it and its calls to ReplicaManager.

The current issues

The main issues now related to the TableManager's code in lines 967-993:

  1. The ordering of internal table's update and replica creation-starting is important.
  2. Internal table's update should be proceed in any case, while replica should be started only if commented out condition on lines 971-973 isn't true.

Related tests failures:

org.apache.ignite.internal.table.distributed.TableManagerRecoveryTest

Probably, the reason of failure for two following tests is null somewhere around ReplicaManager:L679.

testTableIgnoredOnRecovery

Caused by: java.lang.NullPointerException
  at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:992) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
  at org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:868) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
  at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$33(TableManager.java:967) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
  at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
  ... 4 more

testTableStartedOnRecovery

Caused by: java.lang.NullPointerException
  at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:992) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
  at org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:868) ~[ignite-core-3.0.0-SNAPSHOT.jar:?]
  at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$33(TableManager.java:967) ~[ignite-table-3.0.0-SNAPSHOT.jar:?]
  at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:714) ~[?:?]
  ... 4 more

org.apache.ignite.internal.rebalance.ItRebalanceDistributedTest

testRebalanceWithTheSameNodes

The reason of failure is point 2 of main issues: we should start replica only once per node

org.mockito.exceptions.verification.TooManyActualInvocations: 
replicaManager.startReplica(
    <any>,
    <any>,
    <any java.util.function.Function>,
    <any>
);
Wanted 1 time:
-> at org.apache.ignite.internal.replicator.ReplicaManager.startReplica(ReplicaManager.java:583)
But was 3 times:
-> at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:976)
-> at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:976)
-> at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$32(TableManager.java:976)

org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest

Both of failed tests testManualRebalanceIfPartitionIsLost and testManualRebalanceIfMajorityIsLost are unfamiliar and unclear now for me. The common reason is somewhat like:

java.lang.AssertionError: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
  at org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:78)
  at org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:35)
  at org.hamcrest.TypeSafeMatcher.matches(TypeSafeMatcher.java:67)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:10)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
  at org.apache.ignite.internal.disaster.ItDisasterRecoveryReconfigurationTest.testManualRebalanceIfPartitionIsLost(ItDisasterRecoveryReconfigurationTest.java:229)
  at java.base/java.lang.reflect.Method.invoke(Method.java:566)
  at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
  at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException
  at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
  at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
  at org.apache.ignite.internal.testframework.matchers.CompletableFutureMatcher.matchesSafely(CompletableFutureMatcher.java:74)
  ... 8 more
Caused by: java.util.concurrent.TimeoutException
  at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:834)

But I'm sure that isn't the root cause.


Thank you for submitting the pull request.

To streamline the review process of the patch and ensure better code quality we ask both an author and a reviewer to verify the following:

The Review Checklist

  • [ ] Formal criteria: TC status, codestyle, mandatory documentation. Also make sure to complete the following:
    - There is a single JIRA ticket related to the pull request.
    - The web-link to the pull request is attached to the JIRA ticket.
    - The JIRA ticket has the Patch Available state.
    - The description of the JIRA ticket explains WHAT was made, WHY and HOW.
    - The pull request title is treated as the final commit message. The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
  • [ ] Design: new code conforms with the design principles of the components it is added to.
  • [ ] Patch quality: patch cannot be split into smaller pieces, its size must be reasonable.
  • [ ] Code quality: code is clean and readable, necessary developer documentation is added if needed.
  • [ ] Tests code quality: test set covers positive/negative scenarios, happy/edge cases. Tests are effective in terms of execution time and resources.

Notes

JAkutenshi avatar Apr 18 '24 21:04 JAkutenshi

A comment about a test's fix there: before the ticket there wasn't .join() on TableManager, but now there is and if startReplica() returns null -- it fails with NPE. In context of the test, RepelicaManager is mocked and then, e.g. busyLock is null and so on. Without the method mocking the result of startReplica() is null and then .join() faces NPE that leads to TimeoutException on the top of stacktrace. As a solution I just mocking startReplica() that returns completed with null-value future instead just null.

JAkutenshi avatar Apr 29 '24 11:04 JAkutenshi