tidb-operator EBS BR support to multiple k8s cluster TiDB

Feature Request

Is your feature request related to a problem? Please describe:

TiDB operator already supports deployment in multiple k8s clusters, but AWS EBS snapshot based Backup/Restore doesn't support the deployment yet.

Describe the feature you'd like:

AWS EBS snapshot based BR needs to support TiDB cluster deployed in multiple k8s clusters. We would like to use the kubenetes federation solution, where federation operators in control plane cooperate tidb-operators in the data plane by Kubernetes API server. See https://github.com/pingcap/tidb-operator/pull/5004 for details

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

[X] Federation operators
- [X] setup the br-federation-manager (https://github.com/pingcap/tidb-operator/pull/4996)
- [x] federation backup operator (https://github.com/pingcap/tidb-operator/pull/5013)
- [X] federation restore operator (https://github.com/pingcap/tidb-operator/pull/5039)
- [X] federation backup schedule operator (https://github.com/pingcap/tidb-operator/pull/5036)
- [x] warm-up restored EBS volumes (https://github.com/pingcap/tidb-operator/pull/5229)
- [x] change to use one kubeconfig file (https://github.com/pingcap/tidb-operator/pull/5186)
- [x] shrink window of GC and schedling suspension (https://github.com/pingcap/tidb-operator/pull/5288)
- [x] add additional volumes to backup/restore/warmup pods (https://github.com/pingcap/tidb-operator/pull/5414)
- [x] serial execution of volume backup gc (https://github.com/pingcap/tidb-operator/pull/5452)
- [x] expose the volumebackup init pod ttl as a crd attribute (https://github.com/pingcap/tidb-operator/pull/5490)
- [x] Add annotations for warmup pod (https://github.com/pingcap/tidb-operator/pull/5445)
- [x] backoff support for snapshots deletion (https://github.com/pingcap/tidb-operator/pull/5492)
- [x] Add sleep before shutdown backup container (https://github.com/pingcap/tidb-operator/pull/5454)
- [x] Add metrics for volume backup (https://github.com/pingcap/tidb-operator/pull/5545)
- [x] Add GC immune tag (https://github.com/pingcap/tidb-operator/pull/5571)
- [ ] Support multiple backup schedules (https://github.com/pingcap/tidb-operator/pull/5633)
[x] Data plane operator refactoring
- [x] data plane backup process refactoring (https://github.com/pingcap/tidb-operator/pull/4999)
- [x] data plane restore process refactoring (https://github.com/pingcap/tidb-operator/pull/5010/)
- [x] Add operator tags for restored volume (https://github.com/pingcap/tidb-operator/pull/5104)
- [x] Take cluster manifest backup along with data (https://github.com/pingcap/tidb-operator/pull/5207)
- [x] Add incremental volume snapshot backup size(https://github.com/pingcap/tidb-operator/pull/5188)
- [x] Make snapshot size calculation optional (https://github.com/pingcap/tidb-operator/pull/5202)
- [x] Warmup checkpointing support (https://github.com/pingcap/tidb-operator/pull/5238)
- [X] Remove k8s cluster name from restore/warmup pod names (https://github.com/pingcap/tidb-operator/pull/5318)
- [X] add create pv permissions to operator (https://github.com/pingcap/tidb-operator/pull/5314)
- [X] Warmup strategy frame support (https://github.com/pingcap/tidb-operator/pull/5315)
- [x] FSR warmup operator support (https://github.com/pingcap/tidb-operator/pull/5338)
- [x] FSR warmup api quota control (https://github.com/pingcap/tidb/pull/48506)
- [x] FSR credit balance check (https://github.com/pingcap/tidb/pull/48627)
- [x] Remove k8s cluster name from backup/init pod names(https://github.com/pingcap/tidb-operator/pull/5418)
- [x] Support br-federation-manager managing resources in specific-namespace tidb clusters(https://github.com/pingcap/tidb-operator/pull/5410)
- [x] Support to terminate sidecars in BR job (https://github.com/pingcap/tidb-operator/pull/5431)
- [x] add resilience to single TiKV crash (https://github.com/pingcap/tidb-operator/pull/5585)
- [x] test WAL files while warming up (https://github.com/pingcap/tidb-operator/pull/5593)
- [x] Skip corruption check on last WAL (https://github.com/pingcap/tidb-operator/pull/5605)
- [x] Add sleep before shutdown backup container (https://github.com/pingcap/tidb-operator/pull/5606)
- [x] add pv name in restore EBS vol (https://github.com/pingcap/tidb-operator/pull/5615)
- [x] Make warmup failure/skip on corruption configurable (https://github.com/pingcap/tidb-operator/pull/5635)
- [x] Fail restore if warmup fails only during check-wal-only strategy (https://github.com/pingcap/tidb-operator/pull/5636)
[X] BR command line refactoring
- [X] tidb br command support to keep safepoint and pause PD scheduling (https://github.com/pingcap/tidb/pull/43562)
- [x] tidb br command support to do volume snapshot only (https://github.com/pingcap/tidb/pull/43687)
- [X] Use AWS API CreateSnapshots instead of CreateSnapshot to create snapshots for multiple volumes (https://github.com/pingcap/tidb/pull/43591)
- [x] Cross-AZ restore support (https://github.com/pingcap/tidb/pull/43962)
- [x] Tagging to snapshots and restored volumes (https://github.com/pingcap/tidb/pull/43933 and https://github.com/pingcap/tidb/pull/44381)
- [x] Add retry for aws service throttling (https://github.com/pingcap/tidb/pull/44328)
- [x] Add retry in data recovery phase (https://github.com/pingcap/tidb/pull/46094))
- [x] FSR warmup BR support (https://github.com/pingcap/tidb/pull/47272)
- [x] suppport encryted ebs volumes in restore (https://github.com/pingcap/tidb/pull/48900)
- [x] Allow porting labels to restore pod (https://github.com/pingcap/tidb-operator/pull/5349)
- [x] skip wait apply phase of restore (https://github.com/pingcap/tidb/pull/50316)
- [x] allow temporary TiKV unreachable during starting snapshot backup (https://github.com/pingcap/tidb/pull/49154)
- [x] remove wait apply (https://github.com/pingcap/tidb/pull/50316)
- [x] establish connection to all stores before pausing admin (https://github.com/pingcap/tidb/pull/51449)
- [x] pause scheduler after all connections established (https://github.com/pingcap/tidb/pull/51823)
- [x] remove checking store number (https://github.com/pingcap/tidb/pull/51886)
- [x] make an infinity retry for connecting to store (https://github.com/pingcap/tidb/pull/52177)
[x] Serviceability
- [x] Report full backup size for all backup jobs including failed job (https://github.com/pingcap/tidb-operator/pull/5007)
- [X] Report flashback kv read/write metrics (https://github.com/tikv/tikv/pull/14792)
- [x] Record elapsed time for snapshot size calculation (https://github.com/pingcap/tidb-operator/pull/5263)
- [X] Design doc in gitHub (https://github.com/pingcap/tidb-operator/pull/5004)
- [X] release document update (https://github.com/pingcap/docs-tidb-operator/pull/2406/)
- [x] 1.5 release document (https://github.com/pingcap/docs-tidb-operator/pull/2446)
- [x] print goroutines at init pod exit (https://github.com/pingcap/tidb/pull/51371)
- [x] expose track-and-verify-wals-in-manifest config (https://github.com/tikv/tikv/pull/16546)
- [x] print callstack when exiting before preparing done (https://github.com/pingcap/tidb/pull/51371)
[x] Bug fixes** from 7/17 to nowPending issues (https://github.com/pingcap/tidb-operator/labels/area%2Febs-br)
- [x] pass region name for aws EBS volume operation (https://github.com/pingcap/tidb-operator/pull/5195)
- [x] when there are message dropping, snap_restore may fail (https://github.com/tikv/tikv/pull/15124 https://github.com/tikv/tikv/pull/15196)
- [x] snap_restore: cannot elect leader when start up (https://github.com/tikv/tikv/pull/15297)
- [x] resend recover_region while there are TiKV restarts (https://github.com/pingcap/tidb/pull/45361)
- [x] EBS backup clean successfully when the backupmeta is lost (https://github.com/pingcap/tidb-operator/pull/5199)
- [x] Delete a running VolumeBackup, backup CRs are deleted, but metadata and EBS snapshots are still remained (https://github.com/pingcap/tidb-operator/pull/5199)
- [x] ebs br: backup timetaken is wrong (https://github.com/pingcap/tidb-operator/issues/5212)
- [x] Need to add the restore name to the warmup job name, otherwise it will affect the next restore task (https://github.com/pingcap/tidb-operator/pull/5229)
- [x] Avoid to allocate multiple warmup pods to the same node (https://github.com/pingcap/tidb-operator/pull/5229)
- [x] Add retry logic for snapshot size calculation (https://github.com/pingcap/tidb-operator/pull/5232)
- [x] Add label support for IAM ROLE (https://github.com/pingcap/tidb-operator/pull/5241)
- [x] Don't do infinite retry to pause pd scheduler at backup initial phase (https://github.com/pingcap/tidb/pull/46078)
- [x] Killed warmup pod is not auto restarted (https://github.com/pingcap/tidb-operator/pull/5229)
- [x] No complete time field in restore cr (https://github.com/pingcap/tidb-operator/pull/5248)
- [X] avoid to hold global mutex while syncing titan manifest (https://github.com/tikv/tikv/pull/15399)
- [X] add executable mode to file scanner warmup (https://github.com/pingcap/tidb-operator/pull/5258)
- [x] TiKV start fails with "delete blob file twice" error (https://github.com/tikv/tikv/pull/15470)
- [x] Performance jitter on restored 100T cluster (https://github.com/pingcap/tidb-operator/issues/5269)
- [x] file level warmup properly exited once SIGINT received (https://github.com/pingcap/tidb-operator/pull/5272)
- [x] data plane backup job could suspend due to stderr pipeline is filled (https://github.com/pingcap/tidb-operator/pull/5288)
- [x] Check the tikv volume # mismatch (https://github.com/pingcap/tidb-operator/pull/5292)
- [x] Restore panic when concurrent running with lightning local backend mode (https://github.com/pingcap/tidb/pull/47001, https://github.com/tikv/tikv/pull/15612)
- [x] delete restore data pod when restoring will cause restore progress stuck (https://github.com/tikv/tikv/pull/15685)
- [X] allow porting labels when initiating backup/restore member (https://github.com/pingcap/tidb-operator/pull/5241)
- [x] tikv panic during restoring ( https://github.com/tikv/tikv/pull/15946 https://github.com/pingcap/tidb/pull/48439)
- [x] VolumeBackup failed due to resuming GC and scheduling (https://github.com/pingcap/tidb-operator/pull/5332)
- [x] Extend snapshot size calculation backoff (https://github.com/pingcap/tidb-operator/pull/5370)
- [x] Backup failed at GC keeper due to huge resolved_ts gap (https://github.com/tikv/tikv/pull/15937 https://github.com/tikv/tikv/pull/16044, https://github.com/tikv/tikv/pull/16081)
- [x] volume back schedule fail if the backup schedule has been activated after being paused for a long time(https://github.com/pingcap/tidb-operator/issues/5392)
- [x] fix backup stuck due to init pod creating stuck (https://github.com/pingcap/tidb-operator/pull/5457)
- [x] PV creation failed since PVC with the same name exists (https://github.com/pingcap/tidb-operator/pull/5417)
- [x] make sure backup can be resumed even after being paused for a long time (https://github.com/pingcap/tidb-operator/pull/5464)
- [x] fix file creating error due to directory not exists (https://github.com/pingcap/tidb-operator/pull/5469)
- [x] fix redundant pvc and pv in metadata (https://github.com/pingcap/tidb-operator/pull/5461)
- [x] fix other backup member init job check (https://github.com/pingcap/tidb-operator/pull/5479)
- [x] resume gc and pd schedule asap if data plane backup fails (https://github.com/pingcap/tidb-operator/pull/5491)
- [x] set volume backup failed if one data plane failed (https://github.com/pingcap/tidb-operator/pull/5500)
- [x] always save checkpoint to the warmup folder (https://github.com/pingcap/tidb-operator/pull/5507)
- [x] Fixing volume tagging (https://github.com/pingcap/tidb-operator/pull/5519)
- [x] Use CreationTimestamp of volumebackup for gc check (https://github.com/pingcap/tidb-operator/pull/5518)
- [x] Tolerate to rolling restart (https://github.com/pingcap/kvproto/pull/1204, https://github.com/tikv/tikv/pull/15946 and https://github.com/pingcap/tidb/pull/48439)
- [X] Fixing tagging to snapshots and restore volumes (https://github.com/pingcap/tidb-operator/pull/5525 and https://github.com/pingcap/tidb/pull/50548)
- [X] Possible tag number exceeds aws quota (https://github.com/pingcap/tidb/pull/50941)
- [X] Mitigate bad RTO due to raft log gc at upstream cluster (https://github.com/tikv/tikv/pull/16519)
- [X] Fail restore if any warmup job failed for volume-snapshot restores (https://github.com/pingcap/tidb-operator/pull/5578)
- [x] clean volumes when restore volume failed (https://github.com/pingcap/tidb-operator/pull/5639)
- [x] initialize status metric to zero (https://github.com/pingcap/tidb-operator/pull/5648)
- [x] make gRPC connections synced (https://github.com/pingcap/tidb/pull/52051)
- [x] make an infinity retry for connecting to store (https://github.com/pingcap/tidb/pull/52177)
- [x] fix stuck when terminating (https://github.com/pingcap/tidb/pull/52264)
- [x] fix adapt env for snapshot backup stuck when encountered error (https://github.com/pingcap/tidb/pull/52607)

May 10 '23 14:05 BornChanger

/assign @WangLe1321 @BornChanger @YuJuncen @csuzhangxc

May 10 '23 15:05 BornChanger

Please don't include internal google doc link, please upload the design doc to github

May 10 '23 21:05 zhangjinpeng87

Please don't include internal google doc link, please upload the design doc to github

Sure.

May 11 '23 00:05 BornChanger

tidb-operator tidb-operator copied to clipboard

EBS BR support to multiple k8s cluster TiDB

Feature Request

tidb-operator
tidb-operator copied to clipboard