[Azure] After some days etcd-main, etcd-events & kops-controller pods of Azure KOPS clusters filled with 401 errors while trying to access kops storage account
/kind bug
After some days etcd-main, etcd-events & kops-controller pods of Azure KOPS clusters filled with 401 errors while trying to access kops storage account.
Have seen it in multiple clusters.
After some more days, it starts complaining
AuthenticationErrorDetail: Lifetime validation failed. The token is expired
Temp Fix: KOPS-controller pod can be fixed by deleting the pod, new pod comeup fine, but for etcd pods we have to restart the control-plane machine. Expected: Token refresh should happen automatically for system identity.
W0916 05:23:12.468888 4968 controller.go:161] unexpected error running etcd cluster reconciliation loop: error checking control store: error reading cluster-creation marker file azureblob://cluster-configs/<cluster-key>.eastus2.azure.abc.com/backups/etcd/main/control/etcd-cluster-created: -> sigs.k8s.io/etcdadm/etcd-manager/vendor/github.com/Azure/azure-storage-blob-go/azblob.newStorageError, vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42
===== RESPONSE ERROR (ServiceCode=InvalidAuthenticationInfo) =====
Description=Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:xxxxx-xxxxx-xxxxx-xxxxxxx
Time:2024-09-16T05:23:12.4718165Z, Details:
AuthenticationErrorDetail: Signature validation failed. Signature key not found.
Code: InvalidAuthenticationInfo
GET https://<kops-storage-account>.blob.core.windows.net/cluster-configs/<cluster-key>.eastus2.azure.abc.com/backups/etcd/main/control/etcd-cluster-created?timeout=61
Authorization: REDACTED
User-Agent: [Azure-Storage/0.15 (go1.19.9; linux)]
X-Ms-Client-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]
X-Ms-Version: [2020-10-02]
--------------------------------------------------------------------------------
RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
Content-Length: [408]
Content-Type: [application/xml]
Date: [Mon, 16 Sep 2024 05:23:12 GMT]
Server: [Microsoft-HTTPAPI/2.0]
Www-Authenticate: [Bearer authorization_uri=https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx/oauth2/authorize resource_id=https://storage.azure.com/]
X-Ms-Error-Code: [InvalidAuthenticationInfo]
X-Ms-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]
1. What kops version are you running? The command kops version, will display
this information.
v1.28.5
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
v1.28.11
3. What cloud provider are you using? azure
4. What commands did you run? What is the simplest way to reproduce this issue? kubectl -n kube-system logs -f etcd-manager-main-control-plane-eastus2-3000005
5. What happened after the commands executed? The stack trace mentioned can be seen 401 while accessing kops storage account
6. What did you expect to happen? No error
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
name: <cluster_key>.eastus2.azure.abc.com
spec:
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudConfig:
azure:
adminUser: ubuntu
resourceGroupName: <cluster_key>
routeTableName: <cluster_key>
subscriptionId: xxxx
tenantId: xxxxxx
cloudLabels:
cluster-name: <cluster_key>
k8s.io_cluster-autoscaler_<cluster_key>.eastus2.azure.abc.com: owned
k8s.io_cluster-autoscaler_enabled: "1"
k8s.io_cluster-autoscaler_node-template_label: "1"
cloudProvider: azure
configBase: azureblob://cluster-configs/<cluster_key>.eastus2.azure.abc.com
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: control-plane-eastus2-3
volumeType: StandardSSD_LRS
name: etcd-3
manager:
backupRetentionDays: 7
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: control-plane-eastus2-3
volumeType: StandardSSD_LRS
name: etcd-3
manager:
backupRetentionDays: 7
memoryRequest: 100Mi
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeControllerManager:
terminatedPodGCThreshold: 1024
kubeDNS:
provider: CoreDNS
nodeLocalDNS:
enabled: true
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
SerializeImagePulls: true
kubernetesVersion: 1.28.11
masterPublicName: api.<cluster-key>.eastus2.azure.abc.com
networkCIDR: 172.26.240.0/20
kubeProxy:
enabled: true
networking:
cilium:
enableNodePort: false
nonMasqueradeCIDR: 100.64.0.0/10
subnets:
- cidr: 172.26.240.0/22
name: utility-eastus2
region: eastus2
type: Public
- cidr: 172.26.248.0/21
name: eastus2
region: eastus2
type: Private
topology:
dns:
type: None
masters: private
nodes: private
updatePolicy: external
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know? KOPS-controller pod can be fixed by deleting the pod, new pod comeup fine, but for etcd pods we have to restart the control-plane machine.
After some more days, it is filled with
W0916 04:58:08.982238 5081 controller.go:161] unexpected error running etcd cluster reconciliation loop: error checking control store: error reading cluster-creation marker file azureblob://cluster-configs/<cluster-key>.eastus2.azure.abc.com/backups/etcd/main/control/etcd-cluster-created: -> sigs.k8s.io/etcdadm/etcd-manager/vendor/github.com/Azure/azure-storage-blob-go/azblob.newStorageError, vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42
===== RESPONSE ERROR (ServiceCode=InvalidAuthenticationInfo) =====
Description=Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx
Time:2024-09-16T04:58:08.9828773Z, Details:
AuthenticationErrorDetail: Lifetime validation failed. The token is expired.
Code: InvalidAuthenticationInfo
GET https://<kops-storageaccount>.blob.core.windows.net/cluster-configs/<cluster-key>.eastus2.azure.abc.com/backups/etcd/main/control/etcd-cluster-created?timeout=61
Authorization: REDACTED
User-Agent: [Azure-Storage/0.15 (go1.19.9; linux)]
X-Ms-Client-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]
X-Ms-Version: [2020-10-02]
--------------------------------------------------------------------------------
RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
Content-Length: [404]
Content-Type: [application/xml]
Date: [Mon, 16 Sep 2024 04:58:08 GMT]
Server: [Microsoft-HTTPAPI/2.0]
Www-Authenticate: [Bearer authorization_uri=https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx/oauth2/authorize resource_id=https://storage.azure.com]
X-Ms-Error-Code: [InvalidAuthenticationInfo]
X-Ms-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]
cc: @hakman
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/lifecycle frozen
@ajgupta42 I am trying to reproduce this these days, but it will take some time to run the cluster. Do you still encounter such issues?
Yes, i'm still facing it, as a workaround i have scheduled kops-controller pod deletion in every 24 hrs.
etcd pods are still having these warnings
W0708 15:44:46.224794 5059 controller.go:161] unexpected error running etcd cluster reconciliation loop: error checking control store: error reading cluster-creation marker file azureblob://cluster-configs/xyz.eastus.azure.abc.com/backups/etcd/main/control/etcd-cluster-created: -> sigs.k8s.io/etcdadm/etcd-manager/vendor/github.com/Azure/azure-storage-blob-go/azblob.newStorageError, vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42
===== RESPONSE ERROR (ServiceCode=InvalidAuthenticationInfo) =====
Description=Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:xxxxxx-xxxxx-xxxx-xxxx-xxxxx000000
Time:2025-07-08T15:44:46.2256747Z, Details:
AuthenticationErrorDetail: Signature validation failed. Signature key not found.
Code: InvalidAuthenticationInfo
GET https://kopsxyz.blob.core.windows.net/cluster-configs/xyz.eastus.azure.abc.com/backups/etcd/main/control/etcd-cluster-created?timeout=61
Authorization: REDACTED
User-Agent: [Azure-Storage/0.15 (go1.19.9; linux)]
X-Ms-Client-Request-Id: [xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx]
X-Ms-Version: [2020-10-02]
--------------------------------------------------------------------------------
RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
Content-Length: [408]
Content-Type: [application/xml]
Date: [Tue, 08 Jul 2025 15:44:45 GMT]
Server: [Microsoft-HTTPAPI/2.0]
Www-Authenticate: [Bearer authorization_uri=https://login.microsoftonline.com/xxxxxx-xxxx-xxxx-xxxx-xxxxxxxx/oauth2/authorize resource_id=https://storage.azure.com]
X-Ms-Error-Code: [InvalidAuthenticationInfo]
X-Ms-Request-Id: [xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx]
Only etcd, because in this case it means that azure-sdk-for-go fixed the issue.
I updated etcd-manager also in the past weeks, and the fixes should be part of kOps v1.33.0-beta.1.
What version of kOps are you running?
I do have 2 clusters on Azure one is on kops 1.28.5 & other is 1.28.7
I do have 2 clusters on Azure one is on kops 1.28.5 & other is 1.28.7
You might be able to use newer kOps with those k8s versions. But I guess the fixes will probably be mostly in kOps 1.33.
Yup, waiting for final release of 1.33