cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Review and validate failure domain support for placement groups, local and wavelength availability zones

Open randomvariable opened this issue 5 years ago • 13 comments
trafficstars

/kind feature

Describe the solution you'd like [A clear and concise description of what you want to happen.]

CAPA's support for failure domains only takes into consideration Availability Zones.

Inherent in the CAPI model, failure domains are a sort of controller provided property, and don't provide much flexibility for users to define their own.

In AWS, as of August 2020, failure domains include the following dimensions:

  • Regions
  • Availability Zones
  • Placement Groups
    • Cluster: Machine colocality for HPC
    • Partition: Anti-affinity across logical partitions (i.e. racks inside an AZ)
    • Spread: Strict placement of small groups of instances across distinct hardware
  • Local Zones: An AZ within a particular city to for metro-network access
  • Wavelength Zones: Like local zones, but tied to a particular cellular network carrier.

The current implementation of failure domains only takes AZs within a region into account.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api-provider-aws version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/milestone next /priority important-longterm

randomvariable avatar Aug 26 '20 09:08 randomvariable

@randomvariable: The provided milestone is not valid for this repository. Milestones in this repository: [Next, v0.6.0, v0.6.1, v0.6.x]

Use /milestone clear to clear the milestone.

In response to this:

/kind feature

Describe the solution you'd like [A clear and concise description of what you want to happen.]

CAPA's support for failure domains only takes into consideration Availability Zones.

Inherent in the CAPI model, failure domains are a sort of controller provided property, and don't provide much flexibility for users to define their own.

In AWS, as of August 2020, failure domains include the following dimensions:

  • Regions
  • Availability Zones
  • Placement Groups
  • Cluster: Machine colocality for HPC
  • Partition: Anti-affinity across logical partitions (i.e. racks inside an AZ)
  • Spread: Strict placement of small groups of instances across distinct hardware
  • Local Zones: An AZ within a particular city to for metro-network access
  • Wavelength Zones: Like local zones, but tied to a particular cellular network carrier.

The current implementation of failure domains only takes AZs within a region into account.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api-provider-aws version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/milestone next /priority important-longterm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 26 '20 09:08 k8s-ci-robot

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Nov 24 '20 10:11 fejta-bot

/lifecycle frozen /milestone v0.7.0

randomvariable avatar Dec 02 '20 14:12 randomvariable

/triage accepted

sedefsavas avatar Nov 01 '21 17:11 sedefsavas

Are there any detailed plans formed around this feature request? Just wondering, as this seems to suggest extending the existing failure domain primitives, how one would use placement groups in conjunction with availability zones.

As far as I understand, the two are not mutually exclusive and so I would expect the failure domain to stay as the availability zone, but also have the ability to specify the name of a placement group as well.

IIUC this is similar to the availability set support in CAPZ, where you can specify an availability set as well as an availability zone if you desire to. Notably as well, I believe CAPZ will create an availability set if it doesn't exist but is specified, and will also delete it once it's no longer required.

I have been working on fleshing out a POC for placement groups within the OpenShift AWS MAPI provider, so would be happy to contribute towards adding placement group support to CAPA as well.

JoelSpeed avatar Jan 07 '22 13:01 JoelSpeed

@JoelSpeed There is no active work going on for this. Availability zones, placement groups and possibly other points in the issue all may be suitable to group together considering that they are related to instance distribution. I haven't checked what CAPZ is doing yet. It would be great to have a proposal or ADR for this one.

sedefsavas avatar Jan 11 '22 09:01 sedefsavas

Just wanted to post a little more colour to the placement group discussion, we've been discussing this quite a bit within OpenShift and in particular talking about how placement groups should be configured.

Originally we had proposed that the configuration would be part of the MachineTemplate and that the group would be created based on the configuration in the template, however, it was identified that if different configurations were present in different templates, the placement group creation could be non-deterministic as whichever Machine is processed first would win the config and the latter Machines might have different template values.

It seems to me like we need a separate resource (a new CRD?) to be create to represent the placement group. Or as an alternative, I'm wondering if this should be considered a part of the AWSCluster? If we define a list of placement groups as part of the AWSCluster and have the cluster controller reconcile these, then they should be available for the Machines to use as soon as the cluster is set up? Do we have any rules/guidelines for what should/shouldn't become part of the AWSCluster?

JoelSpeed avatar Feb 08 '22 10:02 JoelSpeed

/remove-lifecycle frozen

richardcase avatar Jul 12 '22 15:07 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 10 '22 15:10 k8s-triage-robot

/remove-lifecycle stale

richardcase avatar Oct 10 '22 16:10 richardcase

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 08 '23 17:02 k8s-triage-robot

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Feb 08 '24 18:02 k8s-triage-robot

/lifecycle frozen

vincepri avatar Feb 28 '24 16:02 vincepri