etcd-backup-restore icon indicating copy to clipboard operation
etcd-backup-restore copied to clipboard

[Discuss] Relook at and segregate responsibilities between etcd-backup-restore and etcd-main containers

Open unmarshall opened this issue 2 years ago • 6 comments

How to categorize this issue?

/kind discussion

What would you like to be added

We have started to do some initial work to replace the etcd-custom-image (which today leverages a bootstrap.sh startup script) with a golang application. Since the etcd main container will now be a Golang app (due to Issue#16) , new possibilities arise. There was a brainstorming done as to what should be the segregation of responsibilities between the main etcd container and backup-restore. In this issue we create a place holder to discuss these, invite new ideas and challenge each one of them till we are convinced to either keep the responsibility set of etcd-main and etcd-backup-restore the same as it exists today or change it.

Motivation (Why is this needed?)

Prior to this issue: A discussion was started to evaluate if it would be beneficial to merge etcd-backup-restore and etcd-main containers into a single container: etcd-backup-restore#Issue-557. The conclusion in that ticket was not to go that way and retain the need to have 2 different containers. Please go through all the arguments present by participants.

In this issue the scope is a subset of functionality that is currently handled by the backup-sidecar container - etcd initialisation. At present either via the bootstrap script or via etcd-wrapper we do the following:

  1. Check the initialisation status. If it is New then we trigger the initialisation by invoking a HTTP endpoint hosted by the backup-restore container.
  2. Periodically check the initialisation status (polling interval of 1s)
  3. If the status has now been set to Successful by the backup-sidecar then etcd-wrapper/bootstrap script will attempt to start the etcd process (an embedded etcd) by again fetching the configuration from backup-sidecar.

The goal of this issue is to reach a consensus w.r t simplifying the initialisation process which currently comprises of 2 main steps:

  1. etcd DB validation
  2. In case it is found that the the DB is corrupt then:
    1. In the single node etcd cluster case trigger a restoration by downloading delta + full snapshot available in the backup bucket.
    2. In the case of multi-node etcd cluster (already having more than 1 member) delete the DB and let the leader restore the DB for this etcd instance (will be added as a learner and promoted to a member once this instance is in sync with the leader)

The proposal is to move (1) above completely within etcd-wrapper. Only in case a restoration is required, will the etcd-wrapper communicate with a backup-restore sidecar.

unmarshall avatar Dec 05 '22 10:12 unmarshall

Current set of responsibilities of the backup-sidecar container:

  1. Data validation operations - DB/WAL validation
  2. Follows etcd leader and subsequently marks the backup-restore sidecar attached to the current etcd leader as leading-sidecar. This is done to ensure that certain activities like taking snapshots, defragmentation etcd are only done by the leading-sidecar.
  3. Disaster recovery operations:
    1. Takes delta and full snapshots and uploads it to the configured backup-bucket which have higher reliability guarantees.
    2. Single member restoration. This is now only applicable to single node etcd clusters. For multi-node etcd clusters the leader will now bring up the new learner up to speed and subsequently promote it to a voting member.
    3. Snapshot compaction to reduce the recovery time. This is useful for single member etcd cluster restoration and full quorum loss in case of multi-node clusters.
  4. DB Maintenance operations: Defragmentation - optimises the size of the DB by releasing the fragmented space left after etcd key-space compaction. This should be done for every member and is currently controlled by the leading backup-restore sidecar container.
  5. Etcd cluster operations:
    1. Adding a member.
    2. Promoting a learner to a voting member.
    3. Removing a member.

Initialisation flow of an etcd pod is as follows:

  1. Backup-sidecar container starts up via the server command. This will start a HTTPs server which will serve endpoints (some of which are only internally used by the etcd container).
  2. etcd container uses a custom image which starts-up a bootstrap script which does the following:
    1. Calls initialisation endpoint exposed from backup-restore container. This will do the DB/Volume validation and if it finds that it is corrupt then it can trigger a restoration (typically done for a single member etcd cluster).
    2. When the init succeeds then the bootstrap will invoke another endpoint exposed from backup-restore to get the configuration required to start the etcd process.
    3. Script then starts the etcd process and records the return code.

Following can be re-looked at:

  • Doing the DB validation and initialisation completely within the etcd container which will can be a golang app which will start an embedded etcd. This will remove the need to have to-and-fro communication between backup-restore and etcd containers during the start.
  • backup-restore sidecar will continue to do snapshotting (which includes uploading and downloading snapshots from backup bucket), compaction and defragmentation.

unmarshall avatar Dec 13 '22 06:12 unmarshall

It was also suggested by @shreyas-s-rao that we also relook at making backup sidecar more generic by providing support for pluggable DBs. There are alternatives for etcd - SQLite (used by k3s), rocksDB , badger etc. which can be looked at.

The same point was also discussed separately by others as well (including @vasu1124). Need to study the use case for this in detail.

unmarshall avatar Dec 13 '22 06:12 unmarshall

MOM (23rd Jan) Participants: Shreyas, Ishan, Abhishek, Aaron, Sesha, Madhav

Proposal was floated to have the following responsibility division between the etcd main container and backup-restore container

etcd container:

  • Takes over the initial validation that is done today by backup-restore sidecar container. Argument is that its a DB that is managed by etcd and therefore the validation should also be the responsibility of etcd and not of sidecar.
  • In case of a single node etcd, restoration involves download of full and delta snapshots from the backup-bucket. For a multi-node container restoration is only deletion of the existing DB and starting itself as a learner. etcd leader will then be responsible for updating the learner which subsequently gets promoted as a voting member. etcd container will determine that a restoration is required and will request the backup-restore sidecar to fetch all the snapshots.
  • Today for single node restoration backup-restore has to start an embedded etcd, apply full + delta snapshots and then bring it down. This requires additional resources to be made available to the back-restore. @shreyas-s-rao made a point that this could already be done by etcd container as it has required resources to do so. It should start an embedded etcd on a different port, apply the full + delta snapshots. Once it is done it brings down the embedded etcd and starts the main embedded etcd on the exposed port.

backup-restore Following maintenance and house keeping activities will be responsibility of this container.

  • Triggering periodic and on-demand snapshots
  • Triggering periodic and threshold-based degragmentation (in another discussion it is also proposed that defragmentation could also move to druid)
  • When we introduce EtcdMemberState then this container will be responsible for updating this resource regularly.
  • Download and upload snapshots (delta + full) to the configured backup bucket.
    • Compress and decompress snapshots
    • In future it could also do encryption/decryption of snapshots (see issue#83)
  • Gather and provide an endpoint for scraping of metrics (today via prometheus) for etcd and other maintenance activities that it is responsible for.

Open points

  • Currently backup-restore gets the etcd configurations and mutates the configuration before it is provided to the main etcd container. This needs to be re-looked at as ideally config should directly be mounted and made available to the etcd container.
  • Currently backup-restore determines if the cluster is a single member cluster or a multi-member cluster. This could also be determined by etcd as this information is only used in etcd container to configure if it should start a new cluster or join an existing cluster.

NOTE: All participants should use this issue to raise questions/objections/improvements and be open for any discussion.

unmarshall avatar Jan 27 '23 04:01 unmarshall

Since restoration is being moved to the etcd container, the member addition logic (member remove, member add as learner, and member promote) will be moved into the etcd container, and so etcd should also handle the scale up case With the member addition logic now being in the etcd container, it now makes the coupling between etcd and the sidecar even more tightly coupled than what it is right now.

aaronfern avatar Jan 27 '23 08:01 aaronfern

Member restoration in the multi node case requires these member management operations where we need to remove the member from the cluster member list, add it back as a learner, and then promote it to a full voting member. If these operations are part of the sidecar, then it will require much more coordination between both the pods

Also IMO, these member management tasks directly affect the membership of the etcd cluster and should be part of the etcd, not the sidecar

aaronfern avatar Jan 27 '23 09:01 aaronfern

MOM (16-March) We had a discussion on the proposal that was originally floated on 23rd Jan 2023.

Takeaways:

  • Compaction also requires restoration. Therefore moving restoration functionality to etcd-wrapper would not be optimal.
  • Only moving DB validation to etcd-wrapper was in principal accepted. @ishan16696 wanted to relook at its impact in multi-member clusters (tolerance = node | zone).

Following disadvantages of the current approach of initialisation were listed:

  1. Initialisation (e.g DB validation) of etcd semantically belongs in the etcd-wrapper.
  2. A sidecar should ideally never participate in the initialisation of the main container within the pod. This would make the availability/health of the sidecar mandatory for the main container to start.
  3. Current complexity of code in the backup-restore initialisation flow is high, leading to patches (mostly quick fixes) being made. A re-write of the initialisation flow is therefore anyways required (this point solely covers the effort perspective to make any significant changes in backup-restore code base).
  4. Currently it is just not possible to use etcd-custom-image and also the newly written etcd-wrapper without using the backup-restore sidecar because the main etcd container depends on the backup-sidecar to do the initialisation (DB validation and potential restoration).

unmarshall avatar Mar 16 '23 08:03 unmarshall