kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] Head node autoscaler container fails to communicate to the kubernetes api with a 401 in Azure Kubernetes 1.30

Open plytro opened this issue 1 year ago • 1 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

What happened:

After upgrading to AKS version 1.30 we noted that our head node pods worked for approximately 1 hour and then the autoscaler container in the pod starts to get HTTP 401 responses when querying the kubernetes api. This causes the pod's readiness probe to fail, resulting in the loss off access via the LoadBalancer to the head node.

Through troubleshooting we found the pod definition included this projected volume for the service account token for api access indicating the token has a lifetime of 3607 seconds.

As noted in the AKS 1.30 release notes, service account tokens are no longer given an extended lifetime. By default

I'm not 100% positive I'm reading this code correctly, but it seems like the http client is instantiated one time and reads the token at instantiation and then doesn't account for token expiration with a re-read of the token. We found that if we restart the head node pod the api communication begins working again and successfully makes http calls to the k8s api for 1 hour.

- name: kube-api-access-shgh8
  projected:
    defaultMode: 420
    sources:
    - serviceAccountToken:
        expirationSeconds: 3607
        path: token

Logs

The Ray head is ready. Starting the autoscaler.
  File "/opt/app-root/.conda/envs/env/bin/ray", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2615, in main
    return cli()
           ^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/scripts/scripts.py", line 2338, in kuberay_autoscaler
    run_kuberay_autoscaler(cluster_name, cluster_namespace)
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/kuberay/run_autoscaler.py", line 86, in run_kuberay_autoscaler
    ).run()
      ^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 584, in run
    self._run()
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py", line 389, in _run
    self.autoscaler.update()
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/autoscaler.py", line 384, in update
    raise e
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/autoscaler.py", line 377, in update
    self._update()
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/autoscaler.py", line 400, in _update
    self.non_terminated_nodes = NonTerminatedNodes(self.provider)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/autoscaler.py", line 124, in __init__
    self.all_node_ids = provider.non_terminated_nodes({})
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/batching_node_provider.py", line 162, in non_terminated_nodes
    self.node_data_dict = self.get_node_data()
                          ^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 333, in get_node_data
    self._raycluster = self._get(f"rayclusters/{self.cluster_name}")
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 519, in _get
    return self.k8s_api_client.get(path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/ray/autoscaler/_private/kuberay/node_provider.py", line 273, in get
    result.raise_for_status()
  File "/opt/app-root/.conda/envs/env/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://kubernetes.default:443/apis/ray.io/v1/namespaces/sandbox-plytro/rayclusters/1v1faq9sg7fsf2pcg4sqxs4er-0-raycluster-h662v

What you expected to happen

The cluster autoscaler doesn't lose the ability to communicate with the kube api when the token in the projected volume expires and is replaced with a valid token.

Tagging @andrewsykim @kevin85421 per a discussion in the ray slack.

Reproduction script

I'm working with our dev team to get a code sample that we use to create the RayCluster object that gets sent into the cluster. As this is injected into the definition, I'm not sure how useful it may be for this issue.

Anything else

Notes on token lifetime: https://github.com/kubernetes/enhancements/blob/master/keps/sig-auth/1205-bound-service-account-tokens/README.md

These lines of code and the stack trace led me to the python code referenced above: https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L120 https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L396 https://github.com/ray-project/kuberay/blob/master/ray-operator/controllers/ray/common/pod.go#L454

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

plytro avatar Aug 21 '24 16:08 plytro

We should consider adopting the InClusterConfigLoader in the official Kubernetes pthon client: https://github.com/kubernetes-client/python/blob/master/kubernetes/base/config/incluster_config.py

Or at least something similar that automatically refreshes the token: https://github.com/kubernetes-client/python/blob/392a8c1d0767ce534b121b3b0553e5b1297e430e/kubernetes/base/config/incluster_config.py#L95-L109

andrewsykim avatar Aug 22 '24 16:08 andrewsykim

The PR is reverted.

kevin85421 avatar Jan 22 '25 22:01 kevin85421

Is this issue still active?. More specifically is someone working on (re)merging the PR

We have an ugly hack for this creating a long-lived token from a service account an injecting it using Kyverno, but would be nice to have a better solution

David2011Hernandez avatar Mar 06 '25 15:03 David2011Hernandez

It looks like the PR you're mentioning has been merged and is part of 2.42 https://github.com/ray-project/ray/releases/tag/ray-2.42.0

danc avatar Mar 07 '25 09:03 danc

It eventually didn't get reverted. https://github.com/ray-project/ray/pull/50013

kevin85421 avatar Mar 26 '25 17:03 kevin85421

Just to add a bit of confirmation to this, this update resolved the issue we saw on our Azure AKS managed cluster with OIDC issuer's enabled.

Takadimi avatar Mar 26 '25 18:03 Takadimi