cloud-provider-azure
cloud-provider-azure copied to clipboard
cloud-controller-manager on RKE2 K8s throwing certificate errors in connecting to management.azure.com
What happened:
I was trying to deploy cloud controller manager using the example here https://github.com/kubernetes-sigs/cloud-provider-azure/blob/master/examples/out-of-tree/cloud-controller-manager.yaml I had updated azure.json with the proper values from my environment. But these pods are failing with this error -
F0513 07:50:25.509193 1 controllermanager.go:311] Cloud provider azure could not be initialized: could not init cloud provider azure: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Get "https://management.azure.com/subscriptions/<subscription-id>/providers?api-version=2020-06-01": x509: certificate signed by unknown authority
I tried fetching token using manual curl call to the azure subscription from the same VM and it passes without any certificate errors
curl -X POST -H 'Content-Type: application/x-www-form-urlencoded' https://login.microsoftonline.com/<tenant-id>/oauth2/v2.0/token -d 'client_id=<service-principal-id>' -d 'grant_type=client_credentials' -d 'scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default' -d 'client_secret=<client-secret>'
Can someone please help me understand when these pods throw x509: certificate signed by unknown authority and how we can fix it?
What you expected to happen:
Expected the cloud controller to fetch token from azure properly
How to reproduce it (as minimally and precisely as possible):
Create RKE2 (https://docs.rke2.io) cluster with 1.21.4 kubernetes version and deploy cloud controller manager using example given here https://github.com/kubernetes-sigs/cloud-provider-azure/blob/master/examples/out-of-tree/cloud-controller-manager.yaml
Anything else we need to know?:
N/A
Environment:
- Kubernetes version (use
kubectl version): 1.21.4 - Cloud provider or hardware configuration: Azure (Standard D48s v3 (48 vcpus, 192 GiB memory))
- OS (e.g:
cat /etc/os-release): RHEL 8.2 - Kernel (e.g.
uname -a): Linux sfdev2628626-f755be1b-node0 4.18.0-193.75.1.el8_2.x86_64 - Install tools: N/A
- Network plugin and version (if this is a network-related bug): Cilium
- Others: N/A
@MartinForReal @lzhecheng @jwtty @nearora-msft @feiskyer @nilo19 sorry for tagging you guys directly , can one of you please look into it, we are not able to use the library because of the above mentioned issue
It looks like the azure cloud controller image itself (the one published by the project) doesn't contain the proper root cert bundle. That's the most likely reason why it would complain about not trusting the cert that azure is presenting. So the cloud controller image contains a cert bundle that doesn't trust the azure cert
@feiskyer @nilo19 @lzhecheng Do you think it is a good idea to switch base image to gcr.io/distroless/static-debian11?
any update here, it's weird that the image published by MS doesn't trust azure endpoints
@MartinForReal why we don't have such errors on capz cluster?
I think it is because /etc/ca-certificates is mounted from host node.
Hello, putting here /etc/ca-certificates as mount path instead of /etc/ssl worked for me. Host path remains the same, it depends on your host OS
I was using RHEL as host OS, so /etc/ssl should work. I used RHEL based base image here https://github.com/kubernetes-sigs/cloud-provider-azure/blob/3b92f160f4fd3e0c0272ee5cf10e0f4a950a37cc/Dockerfile#L32 built the image azure-cloud-controller-manager and it worked without changing any other configuration in the deployment yaml.
Do we have any different instructions for using different images for different OS? Thanks for your help.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
no activities for a long time /close
@feiskyer: Closing this issue.
In response to this:
no activities for a long time /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
The same thing is happening when deploying Azure CCM on Flatcar: https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/3387#issuecomment-1495684214 it seems.
The same issue & error is being reproduced for Mariner OS(CAPZ image) as well.