acr icon indicating copy to clipboard operation
acr copied to clipboard

Roadmap: ACR High Availability - Regional Endpoints for ACR Replicas for Regional High Availability

Open johnsonshi opened this issue 7 months ago • 3 comments

Overview

Description

This roadmap item tracks the work to support Replica Endpoints in Azure Container Registry (ACR). The feature will enable new region-specific, read-and-write-enabled endpoints for geo-replicated registries.

Context

Today, geo-replicated registries expose a single global FQDN (e.g., myregistry.azurecr.io). With the replica endpoints feature, users can manually flip a registry config so that each replica will expose a distinct DNS endpoint to allow proximity-based push/pull, improved availability, replica load planning, disaster recovery, and failover.

Problem Statement

Currently:

  • Clients cannot control which replica handles a request.
  • All operations route through a single global endpoint, which the ACR service then routes to the replica that Azure determines has the best network performance profile (based on the client's IP).
  • ACR global endpoint is sometimes routed to a region experiencing outages, leading to the registry being unusable for clients routed by the global endpoint to that region's replica.
    • The only workaround today is to disable the replica in the region with outage to "force" the global endpoint to stop routing there. This is a poor experience asking users to "disable" a replica in an outage region.
  • Regional failover, spreading load across replicas, and network planning are limited with a single global endpoint where clients cannot control which replica ends up processing the push/pull request.

Proposal

Registry owners that create a registry have replica endpoints turned off by default. If they wish to enable it, they must flip a registry configuration to enable all replica endpoints. Replica endpoints can be enabled even if the registry is not geo-replicated (has only 1 replica) to allow clients to slowly onboard to replica endpoints.

Proposed example format:

  • Global endpoint: myregistry.azurecr.io ← push and pull allowed ← this will always be enabled
  • Replica endpoint if enabled (East US): myregistry.eastus.azurecr.io ← push and pull allowed
  • Replica endpoint if enabled (West Europe): myregistry.westeurope.azurecr.io ← push and pull allowed

Use Case

This enables clients to explicitly target a regional replica for both push and pull operations while maintaining centralized management and replication consistency.

Examples:

  • CI/CD platforms and AKS clusters can prefer nearby replicas for lower latency and egress
  • Clients can manually failover between different endpoints if network issues or outages occur, or if the global endpoint repeatedly routes traffic to a replica that experiences issues. For example, clients can: fail over from global endpoint to replica endpoint, replica endpoint to global endpoint, or one replica endpoint to another replica endpoint.
  • Enterprises that need to spread load across replicas can prefer tying specific clusters to replica endpoints (to plan load across replicas). These clusters only then failover to another replica endpoint or the global endpoint on issues encountered.
  • Clients can control routing to specific replicas, even if the global endpoint routs the request (in rare cases) to a replica in a region that is experiencing issues. This unblocks clients in push/pull workflows even if some replicas or the global endpoint are temporarily in a degraded state.

Integrations

With Azure Networking's planned Private Traffic Manager profile integration with Private DNS Zones, customers can integrate that with ACR Regional Endpoints. This allows deployment manifests in AKS clusters to continue pointing to the global endpoint (registryname.azurecr.io). The Private Traffic Manager profile (attached to a Private DNS Zone) will then allow DNS-level redirection from the global endpoint (registryname.azurecr.io) to the regional endpoint (registryname.westus.geo.azurecr.io). This allows pinning of all traffic from a VNET with a Private Endpoint to a specific ACR geo-replica, allowing strong in-region pinning scenarios (cost savings preventing cross-region traffic) as well as disaster resiliency/HA (DNS-level redirection to another regional endpoint when needed).

With Azure Kubernetes Service's planned node containerd mirroring, customers can also integrate that with ACR Regional Endpoints. This allows deployment manifests in AKS clusters to continue pointing to the global endpoint (registryname.azurecr.io). The node containerd mirroring can then be configured to allow node-level redirection from the global endpoint (registryname.azurecr.io) to the regional endpoint (registryname.westus.geo.azurecr.io). This allows pinning of all traffic from AKS nodes to a specific ACR geo-replica, allowing strong in-region pinning scenarios and disaster resiliency/HA scenarios like mentioned above. See https://github.com/Azure/AKS/issues/1940 for the AKS issue.


Milestones

✅ Design and Specs

  • [x] Architecture finalized
  • [x] Internal architecture document reviewed
  • [Pending] Internal PRD reviewed

🚧 Private Preview

  • [Pending] Video demo of replica endpoints proof of concept
  • [Pending] Video demo of replica endpoints proof of concept with private endpoint integration
  • [ ] Private Preview of Replica Endpoints where customers can request access

⏳ Public Preview

  • [ ] Preview rollout in public regions allowing all customers to use the feature without manually requesting access
  • [ ] Public docs in MS Learn in ACR

📦 GA Scope

  • [ ] GA of replica endpoints
  • [ ] AKS and containerd integration guidance + AKS/ACR architectural reference docs that reference replica endpoints (with PE) in various VNet configurations

Status

Active development — follow this issue for milestone updates and preview availability.

johnsonshi avatar Jun 03 '25 19:06 johnsonshi

No further updates at this time. The updated project timelines are being finalized.

getk12 avatar Jul 25 '25 18:07 getk12

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Sep 24 '25 02:09 github-actions[bot]

Update: We're targeting February 2026 for the private preview of Replica Endpoints. We are currently preparing the preview environment and materials.

If you're interested in participating in the private preview, please let us know! This will give you early access to test region-specific endpoints for your geo-replicated registries.

johnsonshi avatar Oct 24 '25 20:10 johnsonshi