source-controller Helm OCI repository - Failing to get credential from azure

Helm OCI repository - Failing to get credential from azure

Open masterphenix opened this issue 1 year ago • 12 comments

Hello, I have created a HelmRepository of type OCI like this:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: myhelmrepo
spec:
  type: oci
  provider: azure
  interval: 10m
  url: oci://myhelmrepo.azurecr.io/helm
  timeout: 60s

I get the following error:

$ kubectl -n flux get helmrepository myhelmrepo
myhelmrepo            oci://myhelmrepo.azurecr.io/helm             17h    False   failed to get credential from azure: DefaultAzureCredential: failed to acquire a token....

My kubernetes cluster underneath is an AKS cluster, and the managed Identity assigned to the kubelet does have access to the whole resource group where my registry is stored. There are other container registries in this resource group with standard docker images, and the cluster is able to pull images just fine.

Am I missing something ?

Sep 14 '22 07:09 masterphenix

Thanks for submitting this bug. Would you mind pasting the output of kubectl get helmrepo myhelmrepo -o jsonpath={.status}, please?

Sep 14 '22 08:09 makkes

Here is the status of the repo:

{"conditions":[{"lastTransitionTime":"2022-09-13T15:23:35Z","message":"failed to get credential from azure: DefaultAzureCredential: failed to acquire a token.\nAttempted credentials:\n\tEnvironmentCredential: missing environment variable AZURE_TENANT_ID\n\tManagedIdentityCredential: IMDS token request timed out\n\tAzureCLICredential: Azure CLI not found on path","observedGeneration":4,"reason":"AuthenticationFailed","status":"False","type":"Ready"}],"lastHandledReconcileAt":"2022-09-13T17:00:56.9966799+02:00","observedGeneration":4}

Sep 14 '22 08:09 masterphenix

There is the following documentation on how to setup contextual login with azure.

https://fluxcd.io/flux/components/source/helmrepositories/#azure

Sep 14 '22 08:09 souleb

I did read that documentation, but its unclear to me what to do to use kubelet managed identity. From what I understand, the aadpodidbinding label is only required if using AAD pod identity.

Sep 14 '22 09:09 masterphenix

Hi, in our azure integration test infrastructure, we use kubelet managed identity and grant the kubernetes cluster access to the registry with a role assignment. We use terraform to do this, here's the code we use https://github.com/fluxcd/test-infra/blob/65e1a901cbb9b3f9f27ffad7f9a32a6366eae1cc/tf-modules/azure/acr/main.tf#L9-L14 . In case you'd like to see the whole setup configuration, refer https://github.com/fluxcd/pkg/blob/dbad05cf95b380c6f619a9bf76dc755c6ff6e3cc/oci/tests/integration/terraform/azure/main.tf. which uses the azure terraform module from the first link.

In order to make sure this works with flux v0.34.0, I created a fresh AKS cluster using the above terraform configurations and pushed an OCI chart to it. Created a HelmRepository object:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: helm-test-repo
  namespace: default
spec:
  interval: 1m0s
  url: oci://fluxtest.azurecr.io/mydemo
  type: oci
  provider: azure

And it just worked:

status:
  conditions:
  - lastTransitionTime: "2022-09-14T11:18:18Z"
    message: Helm repository is ready
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  observedGeneration: 1

HelmRepo is ready. Also created a HelmChart from it:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmChart
metadata:
  name: demo-chart
  namespace: default
spec:
  interval: 5m0s
  chart: demo
  reconcileStrategy: ChartVersion
  sourceRef:
    kind: HelmRepository
    name: helm-test-repo
  version: '0.1.*'

And it too succeeded:

status:
  artifact:
    checksum: 8fcd85b0daeb12f1d7622b6c2574825567b88a1d759250fc6f02f73eefb322fd
    lastUpdateTime: "2022-09-14T11:19:17Z"
    path: helmchart/default/demo-chart/demo-0.1.0.tgz
    revision: 0.1.0
    size: 3750
    url: http://source-controller.flux-system.svc.cluster.local./helmchart/default/demo-chart/demo-0.1.0.tgz
  conditions:
  - lastTransitionTime: "2022-09-14T11:19:17Z"
    message: pulled 'demo' chart with version '0.1.0'
    observedGeneration: 1
    reason: ChartPullSucceeded
    status: "True"
    type: Ready
  - lastTransitionTime: "2022-09-14T11:19:17Z"
    message: pulled 'demo' chart with version '0.1.0'
    observedGeneration: 1
    reason: ChartPullSucceeded
    status: "True"
    type: ArtifactInStorage
  observedChartName: demo
  observedGeneration: 1
  url: http://source-controller.flux-system.svc.cluster.local./helmchart/default/demo-chart/latest.tar.gz

For kubelet managed identity, there's no other configuration needed if the role assignment is right and the HelmRepo has provider: azure set.

managed Identity assigned to the kubelet does have access to the whole resource group where my registry is stored

I'm not very familiar with azure permissions. Maybe you should try role assignment like I showed above and see if that works.

Sep 14 '22 11:09 darkowlzz

Thank you @darkowlzz for your thorough reply. I have double-checked my existing configuration:

I have connected on a cluster node, and confirmed in /etc/kubernetes/azure.json the identity I am using, and also that the tenantId is correct:

    "aadClientId": "msi",
    "aadClientSecret": "msi",
    "tenantId": "xxxxxxxx-xxxxxx-xxxxxx-xxxx-xxxxxxxxxxxxx",
   [...]
    "userAssignedIdentityID": "xxxxxxx-yyyyyyy-zzzzz-zzzz-yyyyyyyy",

That same identity is shown in the result of the 'ps' command:

/usr/local/bin/kubelet [...] kubernetes.azure.com/kubelet-identity-client-id=xxxxx

That identity is assigned the AcrPull role in the ACR registry's role assignements.
All flux controllers are up to date with latest version (0.29.0 for the source controller).
AKS cluster has version 1.24.3

Sep 14 '22 13:09 masterphenix

Sounds like some AKS cluster configuration related differences. All the parameters used in our test cluster is here https://github.com/fluxcd/test-infra/blob/65e1a901cbb9b3f9f27ffad7f9a32a6366eae1cc/tf-modules/azure/aks/main.tf#L6-L25. @masterphenix can you try to check how it's different from your cluster? It's possible that we are missing some common configuration that people usually use. The cluster I created used the default AKS version 1.23.8.

Sep 14 '22 13:09 darkowlzz

I removed the role assignment to see the error and the error I get seems to be different from the partial error string that you've shared:

failed to get credential from azure: error exchanging token: unexpected status code 401 from exchange request

Your error:

failed to get credential from azure: DefaultAzureCredential: failed to acquire a token....

I had to delete the pod to see the change in effect.

We use the DefaultAzureCredential because it attempts to authenticate via various means as documented in https://github.com/Azure/azure-sdk-for-go/blob/sdk/azidentity/v1.1.0/sdk/azidentity/default_azure_credential.go#L31-L36. There's some more detail about the error you shared in https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/azidentity/TROUBLESHOOTING.md#troubleshoot-defaultazurecredential-authentication-issues .

Since I wanted to understand why it's working on my cluster before looking more into the code, I created a new role assignment of type App and assigned it to the System-assigned managed identity kubernetes service with AcrPull role. Restarted source-controller (SC) pod but it didn't work.

Checked with the role assignment created by the terraform config that works, it creates a role assignment of type User-assigned Managed Identity for managed identity member <cluster-name>-agentpool. I tried creating the same myself and the failing authentication logs in SC went away and login started working. @masterphenix Can you check what type of role assignment you have set?

Sep 14 '22 14:09 darkowlzz

The only notable difference from your config that I can see is the network_plugin we use, which is "azure", with network_policy "azure" also. Here is an extract of our terraform config:

resource "azurerm_kubernetes_cluster" "aks-cluster" {
  name                            = var.name
  location                        = azurerm_resource_group.aks-rg.location
  resource_group_name             = azurerm_resource_group.aks-rg.name
  dns_prefix                      = var.aks_cluster_name
  kubernetes_version              = var.aks_version

  default_node_pool {
    name                 = "default"
    vm_size              = var.default.aks_node_size
    node_count           = 1
    type                 = "VirtualMachineScaleSets"
    max_pods             = var.default.max_pods_per_node
    os_disk_size_gb      = var.default.os_disk_size
  }

  network_profile {
    network_plugin     = "azure"
    network_policy     = "azure"
    outbound_type      = "loadBalancer"
  }

  role_based_access_control {
    enabled = true
    azure_active_directory {
      managed                = true
      client_app_id          = null
      server_app_id          = null
      server_app_secret      = null
    }
  }

  identity {
    type = "SystemAssigned"
  }
}

I also confirm that the role assignment I have is of type "User-assigned Managed Identity".

Sep 14 '22 15:09 masterphenix

@masterphenix I didn't noticed that your second comment has the full error:

DefaultAzureCredential: failed to acquire a token.
Attempted credentials:
    EnvironmentCredential: missing environment variable AZURE_TENANT_ID
    ManagedIdentityCredential: IMDS token request timed out
    AzureCLICredential: Azure CLI not found on path

Based on https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/azidentity/TROUBLESHOOTING.md#azure-virtual-machine-managed-identity, the third error case

No response received from the managed identity endpoint.

Description:

No response was received for the request to IMDS or the request timed out.

Mitigation:

Ensure the VM is configured for managed identity as described in managed identity documentation.

Verify the IMDS endpoint is reachable on the VM. See below for instructions.

Sep 15 '22 11:09 darkowlzz

Thank you kindly for your investigations, it does allow to narrow the issue. Following the mitigation provided, I executed this on the node:

$ curl 'http://169.254.169.254/metadata/identity/oauth2/token?resource=https://management.core.windows.net&api-version=2018-02-01' -H "Metadata: true"

Response suggests that the issue is linked to the fact that the node has several UAI assigned to it:

{"error":"invalid_request","error_description":"Multiple user assigned identities exist, please specify the clientId / resourceId of the identity in the token request"}

I thought it was due to the aci_connector that was active on this cluster, and which has its own UAI, but I have disabled it, and I still have the same error. Below is the aci_connector terraform config that we use, by the way:

  addon_profile {
    aci_connector_linux {
     enabled = true
     subnet_name = var.subnet
    }
  }

Now, we also have AAD Pod Identity deployed, for other workloads, so it seems that when AAD Pod Identity is there, Kubelet Identity cannot be used by default, because the call to "DefaultAzureCredential" does not defaults to Kubelet Identity first, I will try using AAD Pod Identity, but I was hoping to be able to avoid that and just use the Kubelet ID instead.

Sep 15 '22 13:09 masterphenix

I have explicitely used the kubelet Identity by applying the AAD Pod Identity label on the source-controller, and it works this way 👍

Sep 21 '22 11:09 masterphenix

@masterphenix, it seems that your issue is solved. Can we close this?

Nov 23 '22 16:11 souleb

Yes it is solved thank you, sorry I forgot to confirm and close

Nov 24 '22 08:11 masterphenix

Hi all, @makkes @souleb @masterphenix

I know this issue is already closed but we are currently running into the same issues mentioned and thought it might be helpful to share my findings.

The root cause of this should be the Multiple user assigned identities exist, please specify the clientId mentioned in:

{"error":"invalid_request","error_description":"Multiple user assigned identities exist, please specify the clientId / resourceId of the identity in the token request"}

With System-assigned Managed Identity you can only have one identity. With User-assigned Managed Identity (UAI) up to 20 are possible. In our case 6 UAIs (node pool identity, Azure Key vault integration, Azure Policy integration, Azure Monitor integration, AKS GitOps extension (based on Flux) ...) are attached to the AKS nodes.

In this case, you will have to tell Azure which one to use.

The .Net SDK provides a bit more details on that one (it's not mentioned in the Go SDK): https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/identity/Azure.Identity/README.md#specify-a-user-assigned-managed-identity-with-defaultazurecredential

In my opinion, this could be done with the following options:

provide an option to define the UAI and expose it via the environment variable AZURE_CLIENT_ID. More details: https://github.com/Azure/azure-sdk-for-go/tree/main/sdk/azidentity#specify-a-user-assigned-managed-identity-for-defaultazurecredential
grab the node pool UAI from /etc/kubernetes/azure.json (security wise this might not be the best idea)

Based on my current understanding this should fix the issue.

Short follow-up on Azure AAD Pod Identity: This is deprecated. The successor is Azure AD Workload Identity (https://azure.github.io/azure-workload-identity/docs/)

Happy to discuss this further.

Update: I did a quick POC to verify this and I was able to get it working by adding the above environment variable to the source controller deployment.

Nov 26 '22 08:11 nmeisenzahl

source-controller source-controller copied to clipboard

Helm OCI repository - Failing to get credential from azure

source-controller
source-controller copied to clipboard