cloud-provider-azure Failed ILB during capz E2E

What happened:

While investigating a test flake I noticed an ILB error that may be interesting:

{ Failure /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:434
Timed out after 900.111s.
Service default/web1zqon9-ilb failed
Service:
{
  "metadata": {
    "name": "web1zqon9-ilb",
    "namespace": "default",
    "uid": "f1eaabf3-cb5d-4859-b786-4f1909e62928",
    "resourceVersion": "1177",
    "creationTimestamp": "2022-04-26T09:30:19Z",
    "annotations": {
      "service.beta.kubernetes.io/azure-load-balancer-internal": "true"
    },
    "finalizers": [
      "service.kubernetes.io/load-balancer-cleanup"
    ],
    "managedFields": [
      {
        "manager": "cloud-controller-manager",
        "operation": "Update",
        "apiVersion": "v1",
        "time": "2022-04-26T09:30:19Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:metadata": {
            "f:finalizers": {
              ".": {},
              "v:\"service.kubernetes.io/load-balancer-cleanup\"": {}
            }
          }
        },
        "subresource": "status"
      },
      {
        "manager": "cluster-api-e2e",
        "operation": "Update",
        "apiVersion": "v1",
        "time": "2022-04-26T09:30:19Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:metadata": {
            "f:annotations": {
              ".": {},
              "f:service.beta.kubernetes.io/azure-load-balancer-internal": {}
            }
          },
          "f:spec": {
            "f:allocateLoadBalancerNodePorts": {},
            "f:externalTrafficPolicy": {},
            "f:internalTrafficPolicy": {},
            "f:ports": {
              ".": {},
              "k:{\"port\":80,\"protocol\":\"TCP\"}": {
                ".": {},
                "f:name": {},
                "f:port": {},
                "f:protocol": {},
                "f:targetPort": {}
              },
              "k:{\"port\":443,\"protocol\":\"TCP\"}": {
                ".": {},
                "f:name": {},
                "f:port": {},
                "f:protocol": {},
                "f:targetPort": {}
              }
            },
            "f:selector": {},
            "f:sessionAffinity": {},
            "f:type": {}
          }
        }
      }
    ]
  },
  "spec": {
    "ports": [
      {
        "name": "http",
        "protocol": "TCP",
        "port": 80,
        "targetPort": 80,
        "nodePort": 32348
      },
      {
        "name": "https",
        "protocol": "TCP",
        "port": 443,
        "targetPort": 443,
        "nodePort": 30461
      }
    ],
    "selector": {
      "app": "web1zqon9"
    },
    "clusterIP": "10.98.217.212",
    "clusterIPs": [
      "10.98.217.212"
    ],
    "type": "LoadBalancer",
    "sessionAffinity": "None",
    "externalTrafficPolicy": "Cluster",
    "ipFamilies": [
      "IPv4"
    ],
    "ipFamilyPolicy": "SingleStack",
    "allocateLoadBalancerNodePorts": true,
    "internalTrafficPolicy": "Cluster"
  },
  "status": {
    "loadBalancer": {}
  }
}
LAST SEEN                      TYPE     REASON                  OBJECT                 MESSAGE
2022-04-26 09:40:34 +0000 UTC  Normal   EnsuringLoadBalancer    service/web1zqon9-ilb  Ensuring load balancer
2022-04-26 09:40:34 +0000 UTC  Warning  SyncLoadBalancerFailed  service/web1zqon9-ilb  Error syncing load balancer: failed to ensure load balancer: reconcileSharedLoadBalancer: failed to list LB: ListManagedLBs: failed to get agent pool vmSet names: GetAgentPoolVMSetNames: failed to execute getAgentPoolScaleSets: not a vmss instance

Full test results is here:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-provider-azure-e2e-full-main/1518865402938003456

What you expected to happen:

Normally the ILB test passes fine

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.22.9 running OOT v1.1.13
Cloud provider or hardware configuration: Azure
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

Apr 26 '22 20:04 jackfrancis

Basically the interesting thing to me is the VMSS failure immediately after the LB is created:

2022-04-26 09:40:34 +0000 UTC  Normal   EnsuringLoadBalancer    service/web1zqon9-ilb  Ensuring load balancer
2022-04-26 09:40:34 +0000 UTC  Warning  SyncLoadBalancerFailed  service/web1zqon9-ilb  Error syncing load balancer: failed to ensure load balancer: reconcileSharedLoadBalancer: failed to list LB: ListManagedLBs: failed to get agent pool vmSet names: GetAgentPoolVMSetNames: failed to execute getAgentPoolScaleSets: not a vmss instance

What might that indicate?

cc @feiskyer @CecileRobertMichon

Apr 26 '22 20:04 jackfrancis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 25 '22 20:07 k8s-triage-robot

/remove-lifecycle stale

Aug 19 '22 18:08 bridgetkromhout

We didn't see any similar issues recently. Please update with more details if you still have such issues and reopen.

/close

Aug 23 '22 01:08 feiskyer

@feiskyer: Closing this issue.

In response to this:

We didn't see any similar issues recently. Please update with more details if you still have such issues and reopen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 23 '22 01:08 k8s-ci-robot

cloud-provider-azure cloud-provider-azure copied to clipboard

Failed ILB during capz E2E

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

cloud-provider-azure
cloud-provider-azure copied to clipboard