celeborn icon indicating copy to clipboard operation
celeborn copied to clipboard

[CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1

Open SteNicholas opened this issue 1 year ago • 7 comments

What changes were proposed in this pull request?

Bump Ratis version from 2.5.1 to 3.0.1. Address incompatible changes:

  • RATIS-589. Eliminate buffer copying in SegmentedRaftLogOutputStream.(https://github.com/apache/ratis/pull/964)
  • RATIS-1677. Do not auto format RaftStorage in RECOVER.(https://github.com/apache/ratis/pull/718)
  • RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)

Why are the changes needed?

Bump Ratis version from 2.5.1 to 3.0.1. Ratis has released v3.0.0, v3.0.1, which release note refers to 3.0.0, 3.0.1. The 3.0.x version include new features like pluggable metrics and lease read, etc, some improvements and bugfixes including:

  • 3.0.0: Change list of ratis 3.0.0 In total, there are roughly 100 commits diffing from 2.5.1 including:

    • Incompatible Changes
      • RaftStorage Auto-Format
      • RATIS-1677. Do not auto format RaftStorage in RECOVER. (https://github.com/apache/ratis/pull/718)
      • RATIS-1694. Fix the compatibility issue of RATIS-1677. (https://github.com/apache/ratis/pull/731)
      • RATIS-1871. Auto format RaftStorage when there is only one directory configured. (https://github.com/apache/ratis/pull/903)
      • Pluggable Ratis-Metrics (RATIS-1688)
      • RATIS-1689. Remove the use of the thirdparty Gauge. (https://github.com/apache/ratis/pull/728)
      • RATIS-1692. Remove the use of the thirdparty Counter. (https://github.com/apache/ratis/pull/732)
      • RATIS-1693. Remove the use of the thirdparty Timer. (https://github.com/apache/ratis/pull/734)
      • RATIS-1703. Move MetricsReporting and JvmMetrics to impl. (https://github.com/apache/ratis/pull/741)
      • RATIS-1704. Fix SuppressWarnings(“VisibilityModifier”) in RatisMetrics. (https://github.com/apache/ratis/pull/742)
      • RATIS-1710. Refactor metrics api and implementation to separated modules. (https://github.com/apache/ratis/pull/749)
      • RATIS-1712. Add a dropwizard 3 implementation of ratis-metrics-api. (https://github.com/apache/ratis/pull/751)
      • RATIS-1391. Update library dropwizard.metrics version to 4.x (https://github.com/apache/ratis/pull/632)
      • RATIS-1601. Use the shaded dropwizard metrics and remove the dependency (https://github.com/apache/ratis/pull/671)
      • Streaming Protocol Change
      • RATIS-1569. Move the asyncRpcApi.sendForward(..) call to the client side. (https://github.com/apache/ratis/pull/635)
    • New Features
      • Leader Lease (RATIS-1864)
      • RATIS-1865. Add leader lease bound ratio configuration (https://github.com/apache/ratis/pull/897)
      • RATIS-1866. Maintain leader lease after AppendEntries (https://github.com/apache/ratis/pull/898)
      • RATIS-1894. Implement ReadOnly based on leader lease (https://github.com/apache/ratis/pull/925)
      • RATIS-1882. Support read-after-write consistency (https://github.com/apache/ratis/pull/913)
      • StateMachine API
      • RATIS-1874. Add notifyLeaderReady function in IStateMachine (https://github.com/apache/ratis/pull/906)
      • RATIS-1897. Make TransactionContext available in DataApi.write(..). (https://github.com/apache/ratis/pull/930)
      • New Configuration Properties
      • RATIS-1862. Add the parameter whether to take Snapshot when stopping to adapt to different services (https://github.com/apache/ratis/pull/896)
      • RATIS-1930. Add a conf for enable/disable majority-add. (https://github.com/apache/ratis/pull/961)
      • RATIS-1918. Introduces parameters that separately control the shutdown of RaftServerProxy by JVMPauseMonitor. (https://github.com/apache/ratis/pull/950)
      • RATIS-1636. Support re-config ratis properties (https://github.com/apache/ratis/pull/800)
      • RATIS-1860. Add ratis-shell cmd to generate a new raft-meta.conf. (https://github.com/apache/ratis/pull/901)
    • Improvements & Bug Fixes
      • Netty
        • RATIS-1898. Netty should use EpollEventLoopGroup by default (https://github.com/apache/ratis/pull/931)
        • RATIS-1899. Use EpollEventLoopGroup for Netty Proxies (https://github.com/apache/ratis/pull/932)
        • RATIS-1921. Shared worker group in WorkerGroupGetter should be closed. (https://github.com/apache/ratis/pull/955)
        • RATIS-1923. Netty: atomic operations require side-effect-free functions. (https://github.com/apache/ratis/pull/956)
      • RaftServer
        • RATIS-1924. Increase the default of raft.server.log.segment.size.max. (https://github.com/apache/ratis/pull/957)
        • RATIS-1892. Unify the lifetime of the RaftServerProxy thread pool (https://github.com/apache/ratis/pull/923)
        • RATIS-1889. NoSuchMethodError: RaftServerMetricsImpl.addNumPendingRequestsGauge https://github.com/apache/ratis/pull/922 (https://github.com/apache/ratis/pull/922)
        • RATIS-761. Handle writeStateMachineData failure in leader. (https://github.com/apache/ratis/pull/927)
        • RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (https://github.com/apache/ratis/pull/933)
        • RATIS-1912. Fix infinity election when perform membership change. (https://github.com/apache/ratis/pull/954)
        • RATIS-1858. Follower keeps logging first election timeout. (https://github.com/apache/ratis/pull/894)
  • 3.0.1:This is a bugfix release. See the changes between 3.0.0 and 3.0.1 releases.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Cluster manual test.

SteNicholas avatar Apr 24 '24 12:04 SteNicholas

Ping @FMX, @szetszwo, @pan3793, @cxzl25.

SteNicholas avatar Apr 24 '24 19:04 SteNicholas

@szetszwo, thanks for your review. cc @FMX.

SteNicholas avatar Apr 25 '24 05:04 SteNicholas

does it allow rolling upgrades? should we upgrade followers or the leader first or something else?

pan3793 avatar Apr 25 '24 05:04 pan3793

@pan3793, the rolling upgrade strategy should be:

  1. Try to upgrade a non-leader master node first.
  2. Follow the strategy of upgrading a single worker node - multiple worker nodes - all upgrades.

cc @RexXiong.

SteNicholas avatar Apr 25 '24 05:04 SteNicholas

@SteNicholas Hi, I wonder if this PR affects the rolling upgrade process. Can ratis 3.0.1 servers communicate with ratis 2.5.1 servers? Can a server of ratis 3.0.1 recover from meta data generated by the ratis 2.5.1 server?

FMX avatar Apr 25 '24 08:04 FMX

@FMX, I didn't test the rolling upgrade process in cluster. I would like to try rolling upgrade for validation of above question.

SteNicholas avatar Apr 25 '24 08:04 SteNicholas

We meet same issue as https://issues.apache.org/jira/browse/RATIS-1860, better to upgrade ratis

AngersZhuuuu avatar May 10 '24 07:05 AngersZhuuuu

@FMX, @pan3793, @AngersZhuuuu, I have tested the rolling upgrade process in test environment as follows: image The result of rolling upgrade is that there is no compatibility problem between the communication of 2.5.1 ratis server and 3.0.1 ratis server. image Meanwhile, I have run a test application successfully based on the above situation of master: image PTAL.

SteNicholas avatar May 28 '24 11:05 SteNicholas

Ping @pan3793, @FMX. PTAL.

SteNicholas avatar May 30 '24 06:05 SteNicholas

Merge to main(v0.5.0)

RexXiong avatar May 30 '24 09:05 RexXiong