sink(ticdc): Add Support for Multiple MySQL-Compatible Downstream Addresses in TiCDC for High Availability
What problem does this PR solve?
Issue Number: close #11475
What is changed and how it works?
Description:
This PR introduces a new feature in TiCDC to enhance the fault tolerance of Changefeeds that target MySQL-compatible downstream databases. Previously, TiCDC could only connect to one MySQL-compatible server in a downstream cluster, and if that server became unavailable, it required manual intervention (i.e., recreating the Changefeed) to restore functionality.
New Feature:
Automatic Failover for Multiple Downstream Addresses:
- Background: In many deployment scenarios, a load balancer is often used to manage high availability for database clusters. However, this can introduce a single point of failure and additional complexity. To simplify deployment and enhance robustness, TiCDC now natively supports specifying multiple MySQL-compatible downstream addresses in the
--sink-urioption during Changefeed creation or update. - Functionality: When a downstream database server becomes unavailable, TiCDC will automatically attempt to switch to another available server from the list of provided addresses, ensuring the continuity of the Changefeed without needing user intervention. This adds a layer of redundancy, making TiCDC more resilient in environments where load balancers may not be feasible.
Key Additions and Changes:
-
DBConnector Implementation:
- Purpose: Manages the connection to MySQL-compatible databases and handles automatic reconnection and failover in case of a server failure. It works by rotating through a list of DSNs (Data Source Names) in a round-robin fashion to find an available database server.
- Code Location: The
DBConnectorstruct and methods have been added in thepkg/sink/mysqlpackage. The core methods include:SwitchToAnAvailableDB: Automatically tries to switch to another available database in the event of a failure.ConfigureDBWhenSwitch: Allows custom configuration logic to be applied when switching to a new connection.
-
Integration into Existing Components:
- Replaced all instances of direct MySQL connection logic with the
DBConnector. This includes:- DDL Sink
- DML Sink
- Observer
- Syncpoint Store
- By using
DBConnector, these components now benefit from automatic failover, making TiCDC more resilient in the face of database outages.
- Replaced all instances of direct MySQL connection logic with the
-
Unit Testing:
- Thorough unit tests have been added for the
DBConnectorto ensure it handles reconnection and failover logic correctly. These tests can be found inpkg/sink/mysql/mysql_connector_test.go.
- Thorough unit tests have been added for the
-
Integration Testing:
- Updated the TiCDC integration tests to verify that Changefeeds can handle multiple downstream addresses. The integration tests cover scenarios where downstream database servers become unavailable, and TiCDC successfully switches to another available server.
- New Script: Added a script
start_downstream_tidb_instancesto start multiple TiDB instances for testing the failover functionality. Thetest_preparefile has been updated to register three downstream TiDB instance ports for these tests. While the integration tests now support up to three TiDB instances by default, more instances can be added if needed by modifying the registered ports.
-
Minor Changes to Existing Scripts:
- The ports for downstream TiDB instances in the integration test scripts (
run.sh) have been modified to accommodate the new multi-instance setup. These changes are purely related to port assignments and do not alter the test logic.
- The ports for downstream TiDB instances in the integration test scripts (
Testing and Validation:
- Unit Tests: Added detailed unit tests for
DBConnector. - Integration Tests: Tested
cdc cli changefeed createandcdc cli changefeed updatewith multiple downstream addresses. Verified that TiCDC correctly switches between downstream instances during failures with multiple TiDB instances to confirm the automatic failover functionality works as expected.
This enhancement greatly improves TiCDC’s reliability and ease of use, especially in complex deployment environments, by reducing dependency on external load balancers and ensuring smooth failover between multiple downstream MySQL-compatible databases.
Check List
Tests
- Unit test
- Integration test
Questions
Will it cause performance regression or break compatibility?
No
Do you need to update user documentation, design documentation or monitoring documentation?
Yes, the user documentation should be updated to reflect the new support for specifying multiple downstream addresses in the --sink-uri option, along with instructions on how to configure and use this feature. Additionally, any design documentation that explains the architecture of TiCDC's database connection management should be updated to include details on the new DBConnector and its failover capabilities. Monitoring documentation should also be updated to account for the behavior and health of multiple downstream connections, including potential alerts when failover occurs.
Release note
Support automatic failover across multiple MySQL-compatible downstream addresses in the `--sink-uri`, ensuring high availability and improved fault tolerance.
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign liuzix for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/test verify /test cdc-integration-kafka-test
/test verify
/test cdc-integration-kafka-test
Codecov Report
Attention: Patch coverage is 66.02317% with 88 lines in your changes missing coverage. Please review.
Project coverage is 57.5027%. Comparing base (
6f697c4) to head (d804633). Report is 155 commits behind head on master.
:x: Your project check has failed because the head coverage (57.5027%) is below the target coverage (60.0000%). You can increase the head coverage or adjust the target coverage.
Additional details and impacted files
| Components | Coverage Δ | |
|---|---|---|
| cdc | 61.2813% <66.0231%> (+0.0998%) |
:arrow_up: |
| dm | 51.0354% <ø> (+0.0141%) |
:arrow_up: |
| engine | 63.3879% <ø> (ø) |
| Flag | Coverage Δ | |
|---|---|---|
| unit | 57.5027% <66.0231%> (+0.0597%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
@@ Coverage Diff @@
## master #11527 +/- ##
================================================
+ Coverage 57.4429% 57.5027% +0.0597%
================================================
Files 851 852 +1
Lines 126421 126580 +159
================================================
+ Hits 72620 72787 +167
+ Misses 48394 48363 -31
- Partials 5407 5430 +23
:rocket: New features to boost your workflow:
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
/test verify
/test cdc-integration-mysql-test /test dm-integration-test
/test cdc-integration-mysql-test /test dm-integration-test
/test cdc-integration-mysql-test
/test cdc-integration-kafka-test
/test cdc-integration-pulsar-test
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@wlwilliamx: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-syncdiff-integration-test | d80463347ffa21a724f8dd39e89893cbc0dd9d0b | link | true | /test pull-syncdiff-integration-test |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.