amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Feature]: JDBC-based High-Availability Service for AMS

Open LiangDai-Mars opened this issue 3 weeks ago • 0 comments

Description

This proposal introduces a new high-availability (HA) service for Amoro Management Service (AMS) based on JDBC. This allows AMS to achieve primary–standby leader election using a shared relational database, providing an alternative to the existing ZooKeeper-based HA mechanism. The core characteristics of this feature are:

  • Consistency: Guarantees a single active leader at any time through optimistic concurrency control (row-versioning).
  • Recoverability: Ensures automatic failover if the active leader fails. Follower nodes can acquire the lease after a configurable time-to-live (TTL) expires.
  • Uniqueness: A unique node identifier ensures that each AMS instance is distinct within the cluster. This feature is introduced in PR #3997.

Use case/motivation

The existing HA implementation for AMS relies on ZooKeeper. While effective, this introduces an external dependency that may not be desirable in all deployment environments. For users who already operate a relational database (like MySQL or PostgreSQL) for AMS metadata, leveraging the same database for HA simplifies the architecture, reduces operational overhead, and lowers the total cost of ownership. By providing a JDBC-based HA option, Amoro offers greater deployment flexibility, allowing users to build a fully self-contained HA cluster without needing a separate ZooKeeper ensemble.

Describe the solution

The solution introduces a new HighAvailabilityContainer interface, with JdbcHighAvailabilityContainer as the core implementation for this feature. The leader election and failover logic is managed through a dedicated database table named ha_lease. Functional Flow The process operates as follows:

  • Startup: On startup, each AMS node attempts to become the leader.
  • Lease Acquisition: A node tries to acquire leadership by updating a designated row in the ha_lease table. This operation is conditional, succeeding only if the current lease has expired. The first node to succeed becomes the leader.
  • Heartbeat and Lease Renewal: The active leader periodically sends a heartbeat to the database to renew its lease. This is an optimistic-locking update that increments a version number and extends the lease_expire_ts (lease expiration timestamp).
  • Demotion and Failover: If the leader fails to renew its lease (e.g., due to a crash or network partition), its lease expires after the configured TTL. Other follower nodes, which are continuously attempting to acquire the lease, will eventually succeed, and one will be promoted to leader. The old leader, if it recovers, will be demoted to a follower.
  • Server Info Updates: Upon gaining leadership, the active node writes its connection information (IP address and ports for different services) to the ha_lease table, ensuring clients can discover the active AMS instance. Compatibility
  • Existing HA: The new implementation is fully compatible with the existing HA framework. The choice between jdbc and zk is determined by the ha.type configuration property. If HA is disabled (ha.enable=false), a NoopHighAvailabilityContainer is used, and the system runs as a standalone node.
  • AMS Components: The HA logic is encapsulated within the amoro-ams module and integrates seamlessly with the Amoro service startup container.
  • Database Support: The ha_lease table schema is compatible with Derby, MySQL, and PostgreSQL. Initialization scripts are provided in the resources to create the required table and indexes.

Subtasks

No response

Related issues

No response

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

Code of Conduct

LiangDai-Mars avatar Dec 08 '25 09:12 LiangDai-Mars