tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

sink(ticdc): Add Support for Multiple MySQL-Compatible Downstream Addresses in TiCDC for High Availability

Open wlwilliamx opened this issue 1 year ago • 14 comments

What problem does this PR solve?

Issue Number: close #11475

What is changed and how it works?

Description:

This PR introduces a new feature in TiCDC to enhance the fault tolerance of Changefeeds that target MySQL-compatible downstream databases. Previously, TiCDC could only connect to one MySQL-compatible server in a downstream cluster, and if that server became unavailable, it required manual intervention (i.e., recreating the Changefeed) to restore functionality.

New Feature:

Automatic Failover for Multiple Downstream Addresses:

  • Background: In many deployment scenarios, a load balancer is often used to manage high availability for database clusters. However, this can introduce a single point of failure and additional complexity. To simplify deployment and enhance robustness, TiCDC now natively supports specifying multiple MySQL-compatible downstream addresses in the --sink-uri option during Changefeed creation or update.
  • Functionality: When a downstream database server becomes unavailable, TiCDC will automatically attempt to switch to another available server from the list of provided addresses, ensuring the continuity of the Changefeed without needing user intervention. This adds a layer of redundancy, making TiCDC more resilient in environments where load balancers may not be feasible.

Key Additions and Changes:

  1. DBConnector Implementation:

    • Purpose: Manages the connection to MySQL-compatible databases and handles automatic reconnection and failover in case of a server failure. It works by rotating through a list of DSNs (Data Source Names) in a round-robin fashion to find an available database server.
    • Code Location: The DBConnector struct and methods have been added in the pkg/sink/mysql package. The core methods include:
      • SwitchToAnAvailableDB: Automatically tries to switch to another available database in the event of a failure.
      • ConfigureDBWhenSwitch: Allows custom configuration logic to be applied when switching to a new connection.
  2. Integration into Existing Components:

    • Replaced all instances of direct MySQL connection logic with the DBConnector. This includes:
      • DDL Sink
      • DML Sink
      • Observer
      • Syncpoint Store
    • By using DBConnector, these components now benefit from automatic failover, making TiCDC more resilient in the face of database outages.
  3. Unit Testing:

    • Thorough unit tests have been added for the DBConnector to ensure it handles reconnection and failover logic correctly. These tests can be found in pkg/sink/mysql/mysql_connector_test.go.
  4. Integration Testing:

    • Updated the TiCDC integration tests to verify that Changefeeds can handle multiple downstream addresses. The integration tests cover scenarios where downstream database servers become unavailable, and TiCDC successfully switches to another available server.
    • New Script: Added a script start_downstream_tidb_instances to start multiple TiDB instances for testing the failover functionality. The test_prepare file has been updated to register three downstream TiDB instance ports for these tests. While the integration tests now support up to three TiDB instances by default, more instances can be added if needed by modifying the registered ports.
  5. Minor Changes to Existing Scripts:

    • The ports for downstream TiDB instances in the integration test scripts (run.sh) have been modified to accommodate the new multi-instance setup. These changes are purely related to port assignments and do not alter the test logic.

Testing and Validation:

  • Unit Tests: Added detailed unit tests for DBConnector.
  • Integration Tests: Tested cdc cli changefeed create and cdc cli changefeed update with multiple downstream addresses. Verified that TiCDC correctly switches between downstream instances during failures with multiple TiDB instances to confirm the automatic failover functionality works as expected.

This enhancement greatly improves TiCDC’s reliability and ease of use, especially in complex deployment environments, by reducing dependency on external load balancers and ensuring smooth failover between multiple downstream MySQL-compatible databases.

Check List

Tests

  • Unit test
  • Integration test

Questions

Will it cause performance regression or break compatibility?

No

Do you need to update user documentation, design documentation or monitoring documentation?

Yes, the user documentation should be updated to reflect the new support for specifying multiple downstream addresses in the --sink-uri option, along with instructions on how to configure and use this feature. Additionally, any design documentation that explains the architecture of TiCDC's database connection management should be updated to include details on the new DBConnector and its failover capabilities. Monitoring documentation should also be updated to account for the behavior and health of multiple downstream connections, including potential alerts when failover occurs.

Release note

Support automatic failover across multiple MySQL-compatible downstream addresses in the `--sink-uri`, ensuring high availability and improved fault tolerance.

wlwilliamx avatar Aug 27 '24 09:08 wlwilliamx

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

ti-chi-bot[bot] avatar Aug 27 '24 09:08 ti-chi-bot[bot]

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign liuzix for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot[bot] avatar Aug 27 '24 09:08 ti-chi-bot[bot]

/test verify /test cdc-integration-kafka-test

wlwilliamx avatar Aug 28 '24 05:08 wlwilliamx

/test verify

wlwilliamx avatar Aug 28 '24 06:08 wlwilliamx

/test cdc-integration-kafka-test

wlwilliamx avatar Aug 28 '24 06:08 wlwilliamx

Codecov Report

Attention: Patch coverage is 66.02317% with 88 lines in your changes missing coverage. Please review.

Project coverage is 57.5027%. Comparing base (6f697c4) to head (d804633). Report is 155 commits behind head on master.

:x: Your project check has failed because the head coverage (57.5027%) is below the target coverage (60.0000%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
Components Coverage Δ
cdc 61.2813% <66.0231%> (+0.0998%) :arrow_up:
dm 51.0354% <ø> (+0.0141%) :arrow_up:
engine 63.3879% <ø> (ø)
Flag Coverage Δ
unit 57.5027% <66.0231%> (+0.0597%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master     #11527        +/-   ##
================================================
+ Coverage   57.4429%   57.5027%   +0.0597%     
================================================
  Files           851        852         +1     
  Lines        126421     126580       +159     
================================================
+ Hits          72620      72787       +167     
+ Misses        48394      48363        -31     
- Partials       5407       5430        +23     
:rocket: New features to boost your workflow:
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Aug 28 '24 06:08 codecov[bot]

/test verify

wlwilliamx avatar Aug 30 '24 08:08 wlwilliamx

/test cdc-integration-mysql-test /test dm-integration-test

wlwilliamx avatar Aug 30 '24 08:08 wlwilliamx

/test cdc-integration-mysql-test /test dm-integration-test

wlwilliamx avatar Aug 30 '24 08:08 wlwilliamx

/test cdc-integration-mysql-test

wlwilliamx avatar Aug 30 '24 08:08 wlwilliamx

/test cdc-integration-kafka-test

wlwilliamx avatar Aug 30 '24 10:08 wlwilliamx

/test cdc-integration-pulsar-test

wlwilliamx avatar Aug 30 '24 10:08 wlwilliamx

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot[bot] avatar Sep 06 '24 08:09 ti-chi-bot[bot]

@wlwilliamx: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-syncdiff-integration-test d80463347ffa21a724f8dd39e89893cbc0dd9d0b link true /test pull-syncdiff-integration-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot[bot] avatar Apr 09 '25 10:04 ti-chi-bot[bot]