ratis icon indicating copy to clipboard operation
ratis copied to clipboard

Backport some bug fixes from the 3.x branch to the 2.x branch

Open ilixiaocui opened this issue 7 months ago • 11 comments

What changes were proposed in this pull request?

We are currently operating ​​Ratis 2.4.0 in production at significant scale​​, where we've observed two recurring issues related to snapshot installation – consistent with existing community reports (reference: RATIS-2140 RATIS-2208)

​​Would it be possible​​, at your convenience, to consider backporting the associated fixes to the ​​2.x maintenance branch​​? Such an effort would greatly assist our team in planning a ​​stable production upgrade path​​ while continuing to leverage this foundational version.

We sincerely appreciate your guidance on this matter and remain grateful for the community's ongoing stewardship of Ratis.

What is the link to the Apache JIRA

RATIS-2140 RATIS-2208

How was this patch tested?

ilixiaocui avatar Jul 16 '25 07:07 ilixiaocui

@szetszwo Looking forward to your assistance.

ilixiaocui avatar Jul 16 '25 07:07 ilixiaocui

@ilixiaocui , sure, we could back port bug fixes to branch-2.

Would you consider upgrading to the recent release 3.2.0?

szetszwo avatar Jul 16 '25 16:07 szetszwo

@ilixiaocui , sure, we could back port bug fixes to branch-2.

Would you consider upgrading to the recent release 3.2.0?

Much appreciated! Since we have dozens of production clusters that need to remain compatible during ongoing upgrades, moving to version 3.2.0 just isn't in the cards for the foreseeable future. Should we plan new clusters down the road, we'll consider upgrading them holistically when we do.

ilixiaocui avatar Jul 17 '25 03:07 ilixiaocui

@ilixiaocui , could you select a list of commits you like to back port? I could merge them to branch-2.

szetszwo avatar Jul 18 '25 21:07 szetszwo

@ilixiaocui , could you select a list of commits you like to back port? I could merge them to branch-2.

The bugs triggered in the production environment are related to the following two issues. The corresponding commit IDs are based on the ratis-3.2.0 release:

RATIS-2140 related 2e7cb45

RATIS-2208 related 2c4e354 cf893f6 337df17 17ca6f4 5d3476f

Thank you again for your assistance! @szetszwo


In addition, the ratis-3.0.0 release notes summarize many bug fixes from the 2.x series. would you consider backporting these fixes to the 2.x branch as well? The corresponding commit IDs are based on the ratis-3.2.0 release:

RATIS-2116: 6390a28 RATIS-1909: b7ffa1b RATIS-1895: d461a01 RATIS-1902: 4c8ef9d RATIS-1912: c35f769 RATIS-1858: 5c47d3b RATIS-1804: 9535259 RATIS-1883: b8ce6d1 RATIS-1920: 5a8519e RATIS-1928: 1b05bfc RATIS-1705: 95b51e5 RATIS-1887: 05f3922 RATIS-1890: be28b39 RATIS-1893: 0e136f3 RATIS-1884: a483bd4 RATIS-872: 22cbefa RATIS-1916: 7015ba2

ilixiaocui avatar Jul 23 '25 05:07 ilixiaocui

@ilixiaocui , tried to merging the list but some of commits (the ones commented out below) have serious conflicts. Let me see how to resolve them.

git cherry-pick 5c47d3b4cafffa8e2bc21276f302d70efbbed5a9 #RATIS-1858. Follower keeps logging first election timeout. (#894)

git cherry-pick 95b51e512ffa3d0798607b82f8b474649413f2bd #RATIS-1705. Fix metrics leak (#744)

git cherry-pick a6719dc63eb90cc6bdc622a0824101945e746475 #RATIS-1873
git cherry-pick a483bd4bf015b5b368215e0d622ff43ed317b0c7 #RATIS-1884. Fix retry cache warning condition (#915)

git cherry-pick b8ce6d1f6ea37ed3ff9f6e888d2357fe48490567 #RATIS-1883. Next Index should be always larger than Match Index in GrpcLogAppender (#914)
git cherry-pick 05f39221102abc00b2934e279da872d06f6a1811 #RATIS-1887. Gap between segement log (#919)
git cherry-pick be28b3907f4fee8957fb2824770e4925364d0a8f #RATIS-1890. SegmentedRaftLogCache#shouldEvict should only iterate over closed segments once (#921)
git cherry-pick 0e136f39123dc65a07a41c7146ea0e91f0fe1fa7 #RATIS-1893. In SegmentedRaftLogCache, start a daemon thread to checkAndEvictCache. (#924)
git cherry-pick d461a01a53e7e130f0ec4143e75b316012137b62 #RATIS-1895. IllegalStateException: Failed to updateIncreasingly for nextIndex. (#926)

git cherry-pick 8a74dc256c875b46025e24d1d9c9de8e8379a53c #RATIS-1886

git cherry-pick 4c8ef9db16e32d13a1eb07fce12a7563b830a2da #RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (#933)
git cherry-pick b7ffa1ba1e3e7cecd9ea687f72425c2ffd5b1c34 #RATIS-1909. Fix Decreasing Next Index When GrpcLogAppender Reset Client. (#939)
git cherry-pick 5a8519ee6cc40abb999d07154c4c2d12320c2da1 #RATIS-1920. NPE in AppendLogResponseHandler. (#952)
git cherry-pick 7015ba2f274394697dffec417b43374656077d88 #RATIS-1916. OrderAsync does not call handReply. (#948)

# git cherry-pick 22cbefa2c11c3471d2f763ccb4251806ed3529f5 #RATIS-872. Invalidate replied calls in retry cache. (#942)

git cherry-pick c35f769f513609d808ab1cc91c5323d9ff30f636 #RATIS-1912. Fix infinity election when perform membership change. (#954)
git cherry-pick 95352591005a1bf867f9aac9f9c0b337741181e3 #RATIS-1804. Change the default number of outstanding append entires. (#838)
git cherry-pick 1b05bfcc76e4f3007d389dc52ee0305b9fff8e41 #RATIS-1928. Join the LogAppenders when closing the server. (#959)

# git cherry-pick 6390a28bdf1d2c454d49a11dca117e5bbc482f54 #RATIS-2116. Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely (#1116)

git cherry-pick 2e7cb458ca6a10b4c38cafca7e8eee8a8e7fcef1 #RATIS-2140. Thread wait when installing snapshot. (#1137)
# git cherry-pick 2c4e354f133a44b971837ea33b5f89d62302cb63 #RATIS-2232. Improve log for debugging on RaftLog / TransactionManager (#1203)
git cherry-pick 337df17c7ea27fbaac9f5f82f8557dc815830d7c #RATIS-2234. Remove lock race between heartbeat and append log channels (#1205)

git cherry-pick cf893f64906df82908fcc43aed2d575e52f7a174 #RATIS-2233. make NOPROGRESS timeout configurable (#1204)
# git cherry-pick 17ca6f41d0a577de2ecb452368c1a38b0c63d8b7 #RATIS-2235. Allow only one thread to perform appendLog  (#1206)
# git cherry-pick 5d3476f27650c13e94d6bbe5ccbfbc7ca4712eea #RATIS-2242. change consistency criteria of heartbeat during appendLog (#1215)

szetszwo avatar Jul 24 '25 15:07 szetszwo

@ilixiaocui , tried to merging the list but some of commits (the ones commented out below) have serious conflicts. Let me see how to resolve them.

git cherry-pick 5c47d3b4cafffa8e2bc21276f302d70efbbed5a9 #RATIS-1858. Follower keeps logging first election timeout. (#894)

git cherry-pick 95b51e512ffa3d0798607b82f8b474649413f2bd #RATIS-1705. Fix metrics leak (#744)

git cherry-pick a6719dc63eb90cc6bdc622a0824101945e746475 #RATIS-1873
git cherry-pick a483bd4bf015b5b368215e0d622ff43ed317b0c7 #RATIS-1884. Fix retry cache warning condition (#915)

git cherry-pick b8ce6d1f6ea37ed3ff9f6e888d2357fe48490567 #RATIS-1883. Next Index should be always larger than Match Index in GrpcLogAppender (#914)
git cherry-pick 05f39221102abc00b2934e279da872d06f6a1811 #RATIS-1887. Gap between segement log (#919)
git cherry-pick be28b3907f4fee8957fb2824770e4925364d0a8f #RATIS-1890. SegmentedRaftLogCache#shouldEvict should only iterate over closed segments once (#921)
git cherry-pick 0e136f39123dc65a07a41c7146ea0e91f0fe1fa7 #RATIS-1893. In SegmentedRaftLogCache, start a daemon thread to checkAndEvictCache. (#924)
git cherry-pick d461a01a53e7e130f0ec4143e75b316012137b62 #RATIS-1895. IllegalStateException: Failed to updateIncreasingly for nextIndex. (#926)

git cherry-pick 8a74dc256c875b46025e24d1d9c9de8e8379a53c #RATIS-1886

git cherry-pick 4c8ef9db16e32d13a1eb07fce12a7563b830a2da #RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (#933)
git cherry-pick b7ffa1ba1e3e7cecd9ea687f72425c2ffd5b1c34 #RATIS-1909. Fix Decreasing Next Index When GrpcLogAppender Reset Client. (#939)
git cherry-pick 5a8519ee6cc40abb999d07154c4c2d12320c2da1 #RATIS-1920. NPE in AppendLogResponseHandler. (#952)
git cherry-pick 7015ba2f274394697dffec417b43374656077d88 #RATIS-1916. OrderAsync does not call handReply. (#948)

# git cherry-pick 22cbefa2c11c3471d2f763ccb4251806ed3529f5 #RATIS-872. Invalidate replied calls in retry cache. (#942)

git cherry-pick c35f769f513609d808ab1cc91c5323d9ff30f636 #RATIS-1912. Fix infinity election when perform membership change. (#954)
git cherry-pick 95352591005a1bf867f9aac9f9c0b337741181e3 #RATIS-1804. Change the default number of outstanding append entires. (#838)
git cherry-pick 1b05bfcc76e4f3007d389dc52ee0305b9fff8e41 #RATIS-1928. Join the LogAppenders when closing the server. (#959)

# git cherry-pick 6390a28bdf1d2c454d49a11dca117e5bbc482f54 #RATIS-2116. Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely (#1116)

git cherry-pick 2e7cb458ca6a10b4c38cafca7e8eee8a8e7fcef1 #RATIS-2140. Thread wait when installing snapshot. (#1137)
# git cherry-pick 2c4e354f133a44b971837ea33b5f89d62302cb63 #RATIS-2232. Improve log for debugging on RaftLog / TransactionManager (#1203)
git cherry-pick 337df17c7ea27fbaac9f5f82f8557dc815830d7c #RATIS-2234. Remove lock race between heartbeat and append log channels (#1205)

git cherry-pick cf893f64906df82908fcc43aed2d575e52f7a174 #RATIS-2233. make NOPROGRESS timeout configurable (#1204)
# git cherry-pick 17ca6f41d0a577de2ecb452368c1a38b0c63d8b7 #RATIS-2235. Allow only one thread to perform appendLog  (#1206)
# git cherry-pick 5d3476f27650c13e94d6bbe5ccbfbc7ca4712eea #RATIS-2242. change consistency criteria of heartbeat during appendLog (#1215)

Appreciate it again.

ilixiaocui avatar Jul 31 '25 11:07 ilixiaocui

@ilixiaocui , sorry that I was not able to check conflicts. I should be able to check them sometime next week. In the meantime, please see if you could find out the dependent commits for resolving the confilcts.

If you have a tight deadline, please feel free to share it. I would try my best to accommodate it.

szetszwo avatar Aug 03 '25 16:08 szetszwo

@ilixiaocui , sorry that I was not able to check conflicts. I should be able to check them sometime next week. In the meantime, please see if you could find out the dependent commits for resolving the confilcts.

If you have a tight deadline, please feel free to share it. I would try my best to accommodate it.

Thanks again for your reply!

Could you please help cherry-pick these two sets of commits that are already causing issues?

RATIS-2140 related 2e7cb45

RATIS-2208 related 2c4e354 cf893f6 337df17 17ca6f4 5d3476f

The other issues haven’t been directly encountered in our production environment. There’s no urgency on timing—one or two weeks is completely fine.

ilixiaocui avatar Aug 04 '25 06:08 ilixiaocui

RATIS-2208 related https://github.com/apache/ratis/commit/2c4e354f133a44b971837ea33b5f89d62302cb63 https://github.com/apache/ratis/commit/cf893f64906df82908fcc43aed2d575e52f7a174 https://github.com/apache/ratis/commit/337df17c7ea27fbaac9f5f82f8557dc815830d7c https://github.com/apache/ratis/commit/17ca6f41d0a577de2ecb452368c1a38b0c63d8b7 https://github.com/apache/ratis/commit/5d3476f27650c13e94d6bbe5ccbfbc7ca4712eea

@ilixiaocui , the first and the last two commits have serious conflicts. We need to find out what commits does it depend on.

# git cherry-pick 2c4e354f133a44b971837ea33b5f89d62302cb63 #RATIS-2232. Improve log for debugging on RaftLog / TransactionManager (#1203)
git cherry-pick 337df17c7ea27fbaac9f5f82f8557dc815830d7c #RATIS-2234. Remove lock race between heartbeat and append log channels (#1205)

git cherry-pick cf893f64906df82908fcc43aed2d575e52f7a174 #RATIS-2233. make NOPROGRESS timeout configurable (#1204)
# git cherry-pick 17ca6f41d0a577de2ecb452368c1a38b0c63d8b7 #RATIS-2235. Allow only one thread to perform appendLog  (#1206)
# git cherry-pick 5d3476f27650c13e94d6bbe5ccbfbc7ca4712eea #RATIS-2242. change consistency criteria of heartbeat during appendLog (#1215)

szetszwo avatar Aug 04 '25 15:08 szetszwo

Thanks for all efforts!@szetszwo, I can take a look at the conflicts this weekend. @ilixiaocui Are you still having the backport issues?

SzyWilliam avatar Nov 12 '25 14:11 SzyWilliam