source-controller
source-controller copied to clipboard
Helm OCI repository - Failing to get credential from azure
Hello, I have created a HelmRepository of type OCI like this:
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: myhelmrepo
spec:
type: oci
provider: azure
interval: 10m
url: oci://myhelmrepo.azurecr.io/helm
timeout: 60s
I get the following error:
$ kubectl -n flux get helmrepository myhelmrepo
myhelmrepo oci://myhelmrepo.azurecr.io/helm 17h False failed to get credential from azure: DefaultAzureCredential: failed to acquire a token....
My kubernetes cluster underneath is an AKS cluster, and the managed Identity assigned to the kubelet does have access to the whole resource group where my registry is stored. There are other container registries in this resource group with standard docker images, and the cluster is able to pull images just fine.
Am I missing something ?
Thanks for submitting this bug. Would you mind pasting the output of kubectl get helmrepo myhelmrepo -o jsonpath={.status}
, please?
Here is the status of the repo:
{"conditions":[{"lastTransitionTime":"2022-09-13T15:23:35Z","message":"failed to get credential from azure: DefaultAzureCredential: failed to acquire a token.\nAttempted credentials:\n\tEnvironmentCredential: missing environment variable AZURE_TENANT_ID\n\tManagedIdentityCredential: IMDS token request timed out\n\tAzureCLICredential: Azure CLI not found on path","observedGeneration":4,"reason":"AuthenticationFailed","status":"False","type":"Ready"}],"lastHandledReconcileAt":"2022-09-13T17:00:56.9966799+02:00","observedGeneration":4}
There is the following documentation on how to setup contextual login with azure.
https://fluxcd.io/flux/components/source/helmrepositories/#azure
I did read that documentation, but its unclear to me what to do to use kubelet managed identity. From what I understand, the aadpodidbinding label is only required if using AAD pod identity.
Hi, in our azure integration test infrastructure, we use kubelet managed identity and grant the kubernetes cluster access to the registry with a role assignment. We use terraform to do this, here's the code we use https://github.com/fluxcd/test-infra/blob/65e1a901cbb9b3f9f27ffad7f9a32a6366eae1cc/tf-modules/azure/acr/main.tf#L9-L14 . In case you'd like to see the whole setup configuration, refer https://github.com/fluxcd/pkg/blob/dbad05cf95b380c6f619a9bf76dc755c6ff6e3cc/oci/tests/integration/terraform/azure/main.tf. which uses the azure terraform module from the first link.
In order to make sure this works with flux v0.34.0, I created a fresh AKS cluster using the above terraform configurations and pushed an OCI chart to it. Created a HelmRepository object:
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: helm-test-repo
namespace: default
spec:
interval: 1m0s
url: oci://fluxtest.azurecr.io/mydemo
type: oci
provider: azure
And it just worked:
status:
conditions:
- lastTransitionTime: "2022-09-14T11:18:18Z"
message: Helm repository is ready
observedGeneration: 1
reason: Succeeded
status: "True"
type: Ready
observedGeneration: 1
HelmRepo is ready. Also created a HelmChart from it:
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmChart
metadata:
name: demo-chart
namespace: default
spec:
interval: 5m0s
chart: demo
reconcileStrategy: ChartVersion
sourceRef:
kind: HelmRepository
name: helm-test-repo
version: '0.1.*'
And it too succeeded:
status:
artifact:
checksum: 8fcd85b0daeb12f1d7622b6c2574825567b88a1d759250fc6f02f73eefb322fd
lastUpdateTime: "2022-09-14T11:19:17Z"
path: helmchart/default/demo-chart/demo-0.1.0.tgz
revision: 0.1.0
size: 3750
url: http://source-controller.flux-system.svc.cluster.local./helmchart/default/demo-chart/demo-0.1.0.tgz
conditions:
- lastTransitionTime: "2022-09-14T11:19:17Z"
message: pulled 'demo' chart with version '0.1.0'
observedGeneration: 1
reason: ChartPullSucceeded
status: "True"
type: Ready
- lastTransitionTime: "2022-09-14T11:19:17Z"
message: pulled 'demo' chart with version '0.1.0'
observedGeneration: 1
reason: ChartPullSucceeded
status: "True"
type: ArtifactInStorage
observedChartName: demo
observedGeneration: 1
url: http://source-controller.flux-system.svc.cluster.local./helmchart/default/demo-chart/latest.tar.gz
For kubelet managed identity, there's no other configuration needed if the role assignment is right and the HelmRepo has provider: azure
set.
managed Identity assigned to the kubelet does have access to the whole resource group where my registry is stored
I'm not very familiar with azure permissions. Maybe you should try role assignment like I showed above and see if that works.
Thank you @darkowlzz for your thorough reply. I have double-checked my existing configuration:
- I have connected on a cluster node, and confirmed in /etc/kubernetes/azure.json the identity I am using, and also that the tenantId is correct:
"aadClientId": "msi",
"aadClientSecret": "msi",
"tenantId": "xxxxxxxx-xxxxxx-xxxxxx-xxxx-xxxxxxxxxxxxx",
[...]
"userAssignedIdentityID": "xxxxxxx-yyyyyyy-zzzzz-zzzz-yyyyyyyy",
- That same identity is shown in the result of the 'ps' command:
/usr/local/bin/kubelet [...] kubernetes.azure.com/kubelet-identity-client-id=xxxxx
- That identity is assigned the AcrPull role in the ACR registry's role assignements.
- All flux controllers are up to date with latest version (0.29.0 for the source controller).
- AKS cluster has version 1.24.3
Sounds like some AKS cluster configuration related differences. All the parameters used in our test cluster is here https://github.com/fluxcd/test-infra/blob/65e1a901cbb9b3f9f27ffad7f9a32a6366eae1cc/tf-modules/azure/aks/main.tf#L6-L25. @masterphenix can you try to check how it's different from your cluster? It's possible that we are missing some common configuration that people usually use. The cluster I created used the default AKS version 1.23.8.
I removed the role assignment to see the error and the error I get seems to be different from the partial error string that you've shared:
failed to get credential from azure: error exchanging token: unexpected status code 401 from exchange request
Your error:
failed to get credential from azure: DefaultAzureCredential: failed to acquire a token....
I had to delete the pod to see the change in effect.
We use the DefaultAzureCredential because it attempts to authenticate via various means as documented in https://github.com/Azure/azure-sdk-for-go/blob/sdk/azidentity/v1.1.0/sdk/azidentity/default_azure_credential.go#L31-L36. There's some more detail about the error you shared in https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/azidentity/TROUBLESHOOTING.md#troubleshoot-defaultazurecredential-authentication-issues .
Since I wanted to understand why it's working on my cluster before looking more into the code, I created a new role assignment of type App and assigned it to the System-assigned managed identity kubernetes service with AcrPull role. Restarted source-controller (SC) pod but it didn't work.
Checked with the role assignment created by the terraform config that works, it creates a role assignment of type User-assigned Managed Identity for managed identity member <cluster-name>-agentpool
. I tried creating the same myself and the failing authentication logs in SC went away and login started working.
@masterphenix Can you check what type of role assignment you have set?
The only notable difference from your config that I can see is the network_plugin we use, which is "azure", with network_policy "azure" also. Here is an extract of our terraform config:
resource "azurerm_kubernetes_cluster" "aks-cluster" {
name = var.name
location = azurerm_resource_group.aks-rg.location
resource_group_name = azurerm_resource_group.aks-rg.name
dns_prefix = var.aks_cluster_name
kubernetes_version = var.aks_version
default_node_pool {
name = "default"
vm_size = var.default.aks_node_size
node_count = 1
type = "VirtualMachineScaleSets"
max_pods = var.default.max_pods_per_node
os_disk_size_gb = var.default.os_disk_size
}
network_profile {
network_plugin = "azure"
network_policy = "azure"
outbound_type = "loadBalancer"
}
role_based_access_control {
enabled = true
azure_active_directory {
managed = true
client_app_id = null
server_app_id = null
server_app_secret = null
}
}
identity {
type = "SystemAssigned"
}
}
I also confirm that the role assignment I have is of type "User-assigned Managed Identity".
@masterphenix I didn't noticed that your second comment has the full error:
DefaultAzureCredential: failed to acquire a token.
Attempted credentials:
EnvironmentCredential: missing environment variable AZURE_TENANT_ID
ManagedIdentityCredential: IMDS token request timed out
AzureCLICredential: Azure CLI not found on path
Based on https://github.com/Azure/azure-sdk-for-go/blob/main/sdk/azidentity/TROUBLESHOOTING.md#azure-virtual-machine-managed-identity, the third error case
No response received from the managed identity endpoint.
Description:
No response was received for the request to IMDS or the request timed out.
Mitigation:
- Ensure the VM is configured for managed identity as described in managed identity documentation.
- Verify the IMDS endpoint is reachable on the VM. See below for instructions.
Thank you kindly for your investigations, it does allow to narrow the issue. Following the mitigation provided, I executed this on the node:
$ curl 'http://169.254.169.254/metadata/identity/oauth2/token?resource=https://management.core.windows.net&api-version=2018-02-01' -H "Metadata: true"
Response suggests that the issue is linked to the fact that the node has several UAI assigned to it:
{"error":"invalid_request","error_description":"Multiple user assigned identities exist, please specify the clientId / resourceId of the identity in the token request"}
I thought it was due to the aci_connector that was active on this cluster, and which has its own UAI, but I have disabled it, and I still have the same error. Below is the aci_connector terraform config that we use, by the way:
addon_profile {
aci_connector_linux {
enabled = true
subnet_name = var.subnet
}
}
Now, we also have AAD Pod Identity deployed, for other workloads, so it seems that when AAD Pod Identity is there, Kubelet Identity cannot be used by default, because the call to "DefaultAzureCredential" does not defaults to Kubelet Identity first, I will try using AAD Pod Identity, but I was hoping to be able to avoid that and just use the Kubelet ID instead.
I have explicitely used the kubelet Identity by applying the AAD Pod Identity label on the source-controller, and it works this way 👍
@masterphenix, it seems that your issue is solved. Can we close this?
Yes it is solved thank you, sorry I forgot to confirm and close
Hi all, @makkes @souleb @masterphenix
I know this issue is already closed but we are currently running into the same issues mentioned and thought it might be helpful to share my findings.
The root cause of this should be the Multiple user assigned identities exist, please specify the clientId
mentioned in:
{"error":"invalid_request","error_description":"Multiple user assigned identities exist, please specify the clientId / resourceId of the identity in the token request"}
With System-assigned Managed Identity you can only have one identity. With User-assigned Managed Identity (UAI) up to 20 are possible. In our case 6 UAIs (node pool identity, Azure Key vault integration, Azure Policy integration, Azure Monitor integration, AKS GitOps extension (based on Flux) ...) are attached to the AKS nodes.
In this case, you will have to tell Azure which one to use.
The .Net SDK provides a bit more details on that one (it's not mentioned in the Go SDK): https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/identity/Azure.Identity/README.md#specify-a-user-assigned-managed-identity-with-defaultazurecredential
In my opinion, this could be done with the following options:
- provide an option to define the UAI and expose it via the environment variable
AZURE_CLIENT_ID
. More details: https://github.com/Azure/azure-sdk-for-go/tree/main/sdk/azidentity#specify-a-user-assigned-managed-identity-for-defaultazurecredential - grab the node pool UAI from
/etc/kubernetes/azure.json
(security wise this might not be the best idea)
Based on my current understanding this should fix the issue.
Short follow-up on Azure AAD Pod Identity: This is deprecated. The successor is Azure AD Workload Identity (https://azure.github.io/azure-workload-identity/docs/)
Happy to discuss this further.
Update: I did a quick POC to verify this and I was able to get it working by adding the above environment variable to the source controller deployment.