cloud-provider-azure icon indicating copy to clipboard operation
cloud-provider-azure copied to clipboard

cloud-controller-manager on RKE2 K8s throwing certificate errors in connecting to management.azure.com

Open aniket202 opened this issue 3 years ago • 7 comments

What happened:

I was trying to deploy cloud controller manager using the example here https://github.com/kubernetes-sigs/cloud-provider-azure/blob/master/examples/out-of-tree/cloud-controller-manager.yaml I had updated azure.json with the proper values from my environment. But these pods are failing with this error -

F0513 07:50:25.509193       1 controllermanager.go:311] Cloud provider azure could not be initialized: could not init cloud provider azure: Retriable: true, RetryAfter: 0s, HTTPStatusCode: -1, RawError: Get "https://management.azure.com/subscriptions/<subscription-id>/providers?api-version=2020-06-01": x509: certificate signed by unknown authority

I tried fetching token using manual curl call to the azure subscription from the same VM and it passes without any certificate errors

curl -X POST -H 'Content-Type: application/x-www-form-urlencoded' https://login.microsoftonline.com/<tenant-id>/oauth2/v2.0/token -d 'client_id=<service-principal-id>' -d 'grant_type=client_credentials' -d 'scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default' -d 'client_secret=<client-secret>'

Can someone please help me understand when these pods throw x509: certificate signed by unknown authority and how we can fix it?

What you expected to happen:

Expected the cloud controller to fetch token from azure properly

How to reproduce it (as minimally and precisely as possible):

Create RKE2 (https://docs.rke2.io) cluster with 1.21.4 kubernetes version and deploy cloud controller manager using example given here https://github.com/kubernetes-sigs/cloud-provider-azure/blob/master/examples/out-of-tree/cloud-controller-manager.yaml

Anything else we need to know?:

N/A

Environment:

  • Kubernetes version (use kubectl version): 1.21.4
  • Cloud provider or hardware configuration: Azure (Standard D48s v3 (48 vcpus, 192 GiB memory))
  • OS (e.g: cat /etc/os-release): RHEL 8.2
  • Kernel (e.g. uname -a): Linux sfdev2628626-f755be1b-node0 4.18.0-193.75.1.el8_2.x86_64
  • Install tools: N/A
  • Network plugin and version (if this is a network-related bug): Cilium
  • Others: N/A

aniket202 avatar May 13 '22 07:05 aniket202

@MartinForReal @lzhecheng @jwtty @nearora-msft @feiskyer @nilo19 sorry for tagging you guys directly , can one of you please look into it, we are not able to use the library because of the above mentioned issue

It looks like the azure cloud controller image itself (the one published by the project) doesn't contain the proper root cert bundle. That's the most likely reason why it would complain about not trusting the cert that azure is presenting. So the cloud controller image contains a cert bundle that doesn't trust the azure cert

rajivml avatar May 20 '22 08:05 rajivml

@feiskyer @nilo19 @lzhecheng Do you think it is a good idea to switch base image to gcr.io/distroless/static-debian11?

MartinForReal avatar May 20 '22 09:05 MartinForReal

any update here, it's weird that the image published by MS doesn't trust azure endpoints

rajivml avatar May 23 '22 06:05 rajivml

@MartinForReal why we don't have such errors on capz cluster?

feiskyer avatar May 23 '22 06:05 feiskyer

I think it is because /etc/ca-certificates is mounted from host node.

MartinForReal avatar May 23 '22 07:05 MartinForReal

Hello, putting here /etc/ca-certificates as mount path instead of /etc/ssl worked for me. Host path remains the same, it depends on your host OS

dariaserkova avatar May 25 '22 09:05 dariaserkova

I was using RHEL as host OS, so /etc/ssl should work. I used RHEL based base image here https://github.com/kubernetes-sigs/cloud-provider-azure/blob/3b92f160f4fd3e0c0272ee5cf10e0f4a950a37cc/Dockerfile#L32 built the image azure-cloud-controller-manager and it worked without changing any other configuration in the deployment yaml.

Do we have any different instructions for using different images for different OS? Thanks for your help.

aniket202 avatar May 25 '22 14:05 aniket202

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 23 '22 14:08 k8s-triage-robot

no activities for a long time /close

feiskyer avatar Aug 26 '22 08:08 feiskyer

@feiskyer: Closing this issue.

In response to this:

no activities for a long time /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 26 '22 08:08 k8s-ci-robot

The same thing is happening when deploying Azure CCM on Flatcar: https://github.com/kubernetes-sigs/cluster-api-provider-azure/issues/3387#issuecomment-1495684214 it seems.

invidian avatar Apr 04 '23 10:04 invidian

The same issue & error is being reproduced for Mariner OS(CAPZ image) as well.

Jasstkn avatar Jan 22 '24 13:01 Jasstkn