skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[k8s][ux] Auto-exclude stale Kubernetes cloud

Open romilbhardwaj opened this issue 2 years ago • 4 comments

I often terminate a Kubernetes cluster externally using the cloud console/cli (e.g., gcloud container clusters delete <cluster-name> --region us-central1-c), but I forget to run sky check to update the list of enabled clouds.

As a result, the next sky launch fails:

sky.exceptions.ResourcesUnavailableError: Timed out when trying to get node info from Kubernetes cluster. Please check if the cluster is healthy and retry.

We should consider printing a warning and continuing by either:

  1. Excluding Kubernetes from the list of clouds considered by the optimizer
  2. Removing Kubernetes from the list of enabled clouds stored in global user state.

1 is less aggressive and doesn't require user to re-run sky check in case it is a transient failure.

romilbhardwaj avatar Nov 21 '23 02:11 romilbhardwaj

This is also related to #3013

Michaelvll avatar Feb 05 '24 06:02 Michaelvll

Going to self-assign and work on this!

kbrgl avatar Feb 24 '24 20:02 kbrgl

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Jun 24 '24 01:06 github-actions[bot]

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Oct 23 '24 01:10 github-actions[bot]

This issue was closed because it has been stalled for 10 days with no activity.

github-actions[bot] avatar Nov 03 '24 02:11 github-actions[bot]

I'm running into this after having renewed my k8s cert in my kube config. I can see the pods as unhealthy, there might have been another isuse.

However, I'm unable to start new clusters on said k8s due this error.

Update

It seems like the error is actually swallowing a real error - in my case BAD_BASE64_DECODE - which in my case I can only see when executing the purge command

chris-aeviator avatar Dec 29 '24 00:12 chris-aeviator

Thanks for the report @chris-aeviator - that sounds bad. Can you share the full output log and the commands you ran so I can reproduce it?

romilbhardwaj avatar Dec 29 '24 01:12 romilbhardwaj

Nvm @chris-aeviator, I can reproduce this. Looks like a recent regression from #4443. Being fixed in https://github.com/skypilot-org/skypilot/pull/4514 - can you give that branch a try and see if it fixes your issue too?

romilbhardwaj avatar Dec 29 '24 02:12 romilbhardwaj

Looking into this.

aylei avatar Feb 08 '25 03:02 aylei