kuadrant-operator icon indicating copy to clipboard operation
kuadrant-operator copied to clipboard

publish strategy for DNS Failover or workload Migration

Open philbrookes opened this issue 1 year ago • 3 comments

Is blocked by : https://github.com/Kuadrant/dns-operator/issues/390

Why

Currently there is no way to ask the DNS Operator to publish or unpublish a DNS Record only when a certain level of redundancy is encountered, this means that gracefully removing a DNS Record requires manual intervention and an understanding of the internal workings of the DNS-Operator.

Some Example Use Cases

  • DNS Failover (rapidly switch to alternative DNS Configuration) to a secondary site
  • Workload migration (removing workload from one cluster in favour of a new cluster)
  • Extra clusters during periods of high load

What

Add an optional publishStrategy to the dns policy CRD, which will allow an administrator to define a some rules which when met will instruct the DNS Operator to publish/unpublish the records from the zone and set a condition in the status.

How

Diagram

Image https://miro.com/app/board/uXjVL32kOMY=/

Kuadrant operator changes

The DNS Policy and DNS Record CRDs will have a new field added to their spec:

publishStrategy:
  rule: <syntax to be confirmed>
  republish: true|false (default false)

This is read by the kuadrant-operator and propagated into any relevant DNS Records.

When the DNS Operator acts on these instructions it will set a condition in the DNS Record.

This condition will be propagated back into the relevant DNS Policy.

DNS Operator Changes

The DNS Operator will read the publishStrategy from the DNS Record on reconcile, based on the values it will then interrogate the zone values to see if the publish rule is met. If so it will publish the records, if not it will ensure the records are unpublished and update the condition in the DNS Record status to reflect the decision.

If the strategy has defined republish to be true, then while the DNS Record exists, if the count of unowned leaf records ever drops below the resiliency requirement, then the DNS Operator will republish these records.

Use cases expanded

DNS Failover

To enact DNS Failover with this config, the rule for publishing could be set to "when all other records are marked as unhealthy".

Example

Cluster 1 publishing strategy is always publish Cluster 2 publishing strategy is: "when number of active records unhealthy >= n"

  • Cluster 1 is currently published and healthy and cluster 2 has no published records.
  • An event occurs that causes the workload to begin malfunctioning on cluster 1.
  • All the records for cluster 1 are marked as unhealthy in the registry (but not removed as they are the only records available)
  • cluster 2 reconciles and sees that all the records currently in the zone are unhealthy, as this satisfies it's publishing rule, it publishes it's records
  • cluster 1 reconciles and sees there are records other than it's own and so unpublishes them for being unhealthy
  • eventually cluster 1 is healthy again and republishes it's records
  • cluster 2 sees records in the zone that are healthy and unpublishes it's own records.

Workload migration

Cluster 1 has a workload that needs to be migrated to cluster 2.

  • workload is created on cluster 2
  • publishing strategy on cluster 1 is set to: "no other records exist" and republish false
  • records created by cluster 2
  • cluster 1 sees other records exist and unpublishes it's records from the zone
  • admin sees the status updated on the DNS Policy in cluster 1 (all records removed from zone) happened more than the TTL ago
  • admin can safely remove the workload from cluster 1.

Extra clusters during high load

This case would require that the rule is able to query metrics, which is not confirmed yet.

Cluster 1 has the workload and publishes always Cluster 2 has the workload and has a publishing rule: when requests per minute > x.

philbrookes avatar Dec 13 '24 16:12 philbrookes

Is there an RFC for this work?

Boomatang avatar Jan 09 '25 12:01 Boomatang

related to https://github.com/Kuadrant/dns-operator/issues/356

maleck13 avatar Jan 27 '25 11:01 maleck13

@philbrookes I feel like this is an overarching epic that has migration and failover as features within it WDYT

Perhaps naming wise DNSRecord publish criteria and then two features of this 1) DNS Failover and 2) Workload Migration?

maleck13 avatar Feb 12 '25 08:02 maleck13