ratis
ratis copied to clipboard
Testing zero-copy bugs fixes (not for merging)
This fix will be split into multiple JIRAs: RATIS-2164, RATIS-2151, RATIS-2173
The following are the bugs found so far:
-
LeakDetector: assertedallLeaksis non-empty but printed "allLeaks.size = 0"- Another bug: Tracks are added to the set before calling
retain. Without callingretainat all, it is not a leak.
- Another bug: Tracks are added to the set before calling
-
SimpleTracingandAdvancedTracing: the methods should be synchronized.- Minor presentation problem:
AdvancedTracingshould have a single track list instead ofretainsTracesandreleaseTraces.
- Minor presentation problem:
-
GrpcClientProtocolService.UnorderedRequestStreamObserver.processClientRequest(..)should use try-finally. -
GrpcLogAppender.appendLog(..)callsrelease()incorrectly for exception. -
LogAppenderDefault.sendAppendEntriesWithRetries(..)callsrelease()incorrectly for exception. -
LogSegmentcache can release an entry multiple times. -
LogSegment.loadCache(..)should callretain()for cache hit. -
SegmentedRaftLog.retainLog(..): between getting the entry and callingretain(), the entry can be released. The "fail to retain" exception, if there is any, can be ignored since It is the same as a cache miss. See #1153 -
SegmentedRaftLog.retainEntryWithData(..)should release for exception. - Test bug: the log entries stored in
SimpleStateMachine4Testingcan be released. -
LogSegment: New entries can be added after EntryCache is closed. -
MemoryRaftLoghas similar problems as inSegmentedRaftLog. -
SegmentedRaftLogWorkershould clean up unfinished tasks in the queue after stopped running.
Finally, it is able to pass all the tests (with a few retries). Note that there are probably some other zero copy bugs. Will fix them separately.
This can pass all the tests (with a few retries). Since this change is quite big (56kB) and non-trivial, I will split this to a few JIRAs:
- The current JIRA RATIS-2164 for fixing
LeakDetector. (I will leave this PR as-is and submit another PR.) - RATIS-2151 for fixing gRPC.
- RATIS-2159 for fixing other non-gRPC cases.
I will see if (2) and (3) needed to be further split.
BTW, we should move LeakDetector enabling from MiniRaftClusterWithGrpc to MiniRaftCluster. It will be able to detect more failures.