kops [Azure] After some days etcd-main, etcd-events & kops-controller pods of Azure KOPS clusters filled with 401 errors while trying to access kops storage account

/kind bug

After some days etcd-main, etcd-events & kops-controller pods of Azure KOPS clusters filled with 401 errors while trying to access kops storage account. Have seen it in multiple clusters. After some more days, it starts complaining AuthenticationErrorDetail: Lifetime validation failed. The token is expired

Temp Fix: KOPS-controller pod can be fixed by deleting the pod, new pod comeup fine, but for etcd pods we have to restart the control-plane machine. Expected: Token refresh should happen automatically for system identity.

W0916 05:23:12.468888    4968 controller.go:161] unexpected error running etcd cluster reconciliation loop: error checking control store: error reading cluster-creation marker file azureblob://cluster-configs/<cluster-key>.eastus2.azure.abc.com/backups/etcd/main/control/etcd-cluster-created: -> sigs.k8s.io/etcdadm/etcd-manager/vendor/github.com/Azure/azure-storage-blob-go/azblob.newStorageError, vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42
===== RESPONSE ERROR (ServiceCode=InvalidAuthenticationInfo) =====
Description=Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:xxxxx-xxxxx-xxxxx-xxxxxxx
Time:2024-09-16T05:23:12.4718165Z, Details:
   AuthenticationErrorDetail: Signature validation failed. Signature key not found.
   Code: InvalidAuthenticationInfo
   GET https://<kops-storage-account>.blob.core.windows.net/cluster-configs/<cluster-key>.eastus2.azure.abc.com/backups/etcd/main/control/etcd-cluster-created?timeout=61
   Authorization: REDACTED
   User-Agent: [Azure-Storage/0.15 (go1.19.9; linux)]
   X-Ms-Client-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]
   X-Ms-Version: [2020-10-02]
   --------------------------------------------------------------------------------
   RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
   Content-Length: [408]
   Content-Type: [application/xml]
   Date: [Mon, 16 Sep 2024 05:23:12 GMT]
   Server: [Microsoft-HTTPAPI/2.0]
   Www-Authenticate: [Bearer authorization_uri=https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx/oauth2/authorize resource_id=https://storage.azure.com/]
   X-Ms-Error-Code: [InvalidAuthenticationInfo]
   X-Ms-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]

1. What kops version are you running? The command kops version, will display this information. v1.28.5

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. v1.28.11

3. What cloud provider are you using? azure

4. What commands did you run? What is the simplest way to reproduce this issue? kubectl -n kube-system logs -f etcd-manager-main-control-plane-eastus2-3000005

5. What happened after the commands executed? The stack trace mentioned can be seen 401 while accessing kops storage account

6. What did you expect to happen? No error

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: <cluster_key>.eastus2.azure.abc.com
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    azure:
      adminUser: ubuntu
      resourceGroupName: <cluster_key>
      routeTableName: <cluster_key>
      subscriptionId: xxxx
      tenantId: xxxxxx
  cloudLabels:
    cluster-name: <cluster_key>
    k8s.io_cluster-autoscaler_<cluster_key>.eastus2.azure.abc.com: owned
    k8s.io_cluster-autoscaler_enabled: "1"
    k8s.io_cluster-autoscaler_node-template_label: "1"
  cloudProvider: azure
  configBase: azureblob://cluster-configs/<cluster_key>.eastus2.azure.abc.com
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: control-plane-eastus2-3
      volumeType: StandardSSD_LRS
      name: etcd-3
    manager:
      backupRetentionDays: 7
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: control-plane-eastus2-3
       volumeType: StandardSSD_LRS
       name: etcd-3
    manager:
      backupRetentionDays: 7
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeControllerManager:
    terminatedPodGCThreshold: 1024
  kubeDNS:
    provider: CoreDNS
    nodeLocalDNS:
      enabled: true
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    SerializeImagePulls: true
  kubernetesVersion: 1.28.11
  masterPublicName: api.<cluster-key>.eastus2.azure.abc.com
  networkCIDR: 172.26.240.0/20
  kubeProxy:
    enabled: true
  networking:
    cilium:
      enableNodePort: false
  nonMasqueradeCIDR: 100.64.0.0/10
  subnets:
  - cidr: 172.26.240.0/22
    name: utility-eastus2
    region: eastus2
    type: Public
  - cidr: 172.26.248.0/21
    name: eastus2
    region: eastus2
    type: Private
  topology:
    dns:
      type: None
    masters: private
    nodes: private
  updatePolicy: external

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know? KOPS-controller pod can be fixed by deleting the pod, new pod comeup fine, but for etcd pods we have to restart the control-plane machine.

After some more days, it is filled with

W0916 04:58:08.982238    5081 controller.go:161] unexpected error running etcd cluster reconciliation loop: error checking control store: error reading cluster-creation marker file azureblob://cluster-configs/<cluster-key>.eastus2.azure.abc.com/backups/etcd/main/control/etcd-cluster-created: -> sigs.k8s.io/etcdadm/etcd-manager/vendor/github.com/Azure/azure-storage-blob-go/azblob.newStorageError, vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42
===== RESPONSE ERROR (ServiceCode=InvalidAuthenticationInfo) =====
Description=Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx
Time:2024-09-16T04:58:08.9828773Z, Details:
   AuthenticationErrorDetail: Lifetime validation failed. The token is expired.
   Code: InvalidAuthenticationInfo
   GET https://<kops-storageaccount>.blob.core.windows.net/cluster-configs/<cluster-key>.eastus2.azure.abc.com/backups/etcd/main/control/etcd-cluster-created?timeout=61
   Authorization: REDACTED
   User-Agent: [Azure-Storage/0.15 (go1.19.9; linux)]
   X-Ms-Client-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]
   X-Ms-Version: [2020-10-02]
   --------------------------------------------------------------------------------
   RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
   Content-Length: [404]
   Content-Type: [application/xml]
   Date: [Mon, 16 Sep 2024 04:58:08 GMT]
   Server: [Microsoft-HTTPAPI/2.0]
   Www-Authenticate: [Bearer authorization_uri=https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx/oauth2/authorize resource_id=https://storage.azure.com]
   X-Ms-Error-Code: [InvalidAuthenticationInfo]
   X-Ms-Request-Id: [xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx]

cc: @hakman

Sep 16 '24 07:09 ajgupta42

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 19 '24 17:12 k8s-triage-robot

/remove-lifecycle stale

Jan 07 '25 16:01 ajgupta42

/lifecycle frozen

Jan 07 '25 16:01 hakman

@ajgupta42 I am trying to reproduce this these days, but it will take some time to run the cluster. Do you still encounter such issues?

Jul 08 '25 14:07 hakman

Yes, i'm still facing it, as a workaround i have scheduled kops-controller pod deletion in every 24 hrs.

Jul 08 '25 16:07 ajgupta42

etcd pods are still having these warnings

W0708 15:44:46.224794    5059 controller.go:161] unexpected error running etcd cluster reconciliation loop: error checking control store: error reading cluster-creation marker file azureblob://cluster-configs/xyz.eastus.azure.abc.com/backups/etcd/main/control/etcd-cluster-created: -> sigs.k8s.io/etcdadm/etcd-manager/vendor/github.com/Azure/azure-storage-blob-go/azblob.newStorageError, vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42
===== RESPONSE ERROR (ServiceCode=InvalidAuthenticationInfo) =====
Description=Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
RequestId:xxxxxx-xxxxx-xxxx-xxxx-xxxxx000000
Time:2025-07-08T15:44:46.2256747Z, Details:
   AuthenticationErrorDetail: Signature validation failed. Signature key not found.
   Code: InvalidAuthenticationInfo
   GET https://kopsxyz.blob.core.windows.net/cluster-configs/xyz.eastus.azure.abc.com/backups/etcd/main/control/etcd-cluster-created?timeout=61
   Authorization: REDACTED
   User-Agent: [Azure-Storage/0.15 (go1.19.9; linux)]
   X-Ms-Client-Request-Id: [xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx]
   X-Ms-Version: [2020-10-02]
   --------------------------------------------------------------------------------
   RESPONSE Status: 401 Server failed to authenticate the request. Please refer to the information in the www-authenticate header.
   Content-Length: [408]
   Content-Type: [application/xml]
   Date: [Tue, 08 Jul 2025 15:44:45 GMT]
   Server: [Microsoft-HTTPAPI/2.0]
   Www-Authenticate: [Bearer authorization_uri=https://login.microsoftonline.com/xxxxxx-xxxx-xxxx-xxxx-xxxxxxxx/oauth2/authorize resource_id=https://storage.azure.com]
   X-Ms-Error-Code: [InvalidAuthenticationInfo]
   X-Ms-Request-Id: [xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx]

Jul 08 '25 16:07 ajgupta42

Only etcd, because in this case it means that azure-sdk-for-go fixed the issue. I updated etcd-manager also in the past weeks, and the fixes should be part of kOps v1.33.0-beta.1. What version of kOps are you running?

Jul 08 '25 16:07 hakman

I do have 2 clusters on Azure one is on kops 1.28.5 & other is 1.28.7

Jul 08 '25 16:07 ajgupta42

I do have 2 clusters on Azure one is on kops 1.28.5 & other is 1.28.7

You might be able to use newer kOps with those k8s versions. But I guess the fixes will probably be mostly in kOps 1.33.

Jul 08 '25 17:07 hakman

Yup, waiting for final release of 1.33

Jul 08 '25 17:07 ajgupta42