docker-engine breaks acs-engine clusters prior to v0.25.0
ACS Engine deprecated the open source docker-engine project as a default CRI dependency with the v0.25.0 release:
https://github.com/Azure/acs-engine/releases/tag/v0.25.0
Relevant change:
https://github.com/Azure/acs-engine/pull/3896
And, for clusters running GPU workloads (e.g. a pool backed by a Standard_NC* VM SKU), all versions of acs-engine are affected!
Recently, Docker has deprecated the public apt repository hosting those packages, which is explained on the landing page that replaces the old apt repository directory:
http://apt.dockerproject.org/repo
In practice, what that means is if you are running a cluster created by any version of acs-engine prior to v0.25.0 (or, if you are running GPU workloads, any version of acs-engine applies, not just versions prior to v0.25.0), all cluster operations will fail, e.g., scale and upgrade. We don't expect this scenario to change, as it originates from Docker, and the docker-engine apt artifacts were made available at their discretion.
What to do if I am affected?
Upgrade your cluster to aks-engine!: https://github.com/Azure/aks-engine/releases
We recommend using aks-engine upgrade, using the latest version of aks-engine, to keep clusters that meet the above description, running stably going forward.