Deleting failed inference workspace using kubectl is stuck.
Describe the bug
If the inference pod is in bad status, I tried to delete the inference pod using workspace delete (kubectl delete workspace workspace-custom-llm) however it neither succeeded or fail, I have no insight on whether it is stuck or taking too long to finish.
Steps To Reproduce 1.Create AKS 2.Create GPU nodepool 3.Install kaito via helm. 4.Deploy YAML file with preferred node, and a model ID that does not exist. 5.Delete the failed workspace using kubectl 6.The command is stuck. and can only ctrl-c to cancel it. 7.Delete the pod directly from Portal and it succeeded.
YAML
apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: workspace-custom-llm resource: labelSelector: matchLabels: apps: llm-inference preferredNodes:
- aks-gputest-23010365-vmss000000
inference:
template:
spec:
containers:
- name: custom-llm-container
image: ghcr.io/kaito-project/kaito/llm-reference-preset:latest
command: ["accelerate"]
args:
- "launch"
- "--num_processes"
- "1"
- "--num_machines"
- "1"
- "--gpu_ids"
- "all"
- "inference_api.py"
- "--pipeline"
- "text-generation"
- "--trust_remote_code"
- "--allow_remote_files"
- "--pretrained_model_name_or_path"
- "Phi-3.5-mini-instruct" # Replace <MODEL_ID> with the specific HuggingFace model identifier
- "--torch_dtype"
- "float16" # Set to "float16" for compatibility with V100 GPUs; use "bfloat16" for A100, H100 or newer GPUs volumeMounts:
- name: dshm mountPath: /dev/shm volumes:
- name: dshm emptyDir: medium: Memory
- name: custom-llm-container
image: ghcr.io/kaito-project/kaito/llm-reference-preset:latest
command: ["accelerate"]
args:
Expected behavior Kubectl delete succeeded.
Logs
Environment
- Kubernetes version (use
kubectl version): 1.30.9 - OS (e.g:
cat /etc/os-release): Ubuntu - Install tools: kubectl
- Others:
Additional context
@hungry1526 Is this error reproducible, and do you have the kaito-pod logs? I couldn’t reproduce it following your steps.
A couple of things to check:
- Are you adding the label from
resourceSpecto the node (in your casellm-inference)? If not, Kaito won’t match the pod topreferredNodeand will try to create a new one. - In this case, when you start and delete a workspace, nodeclaim.WaitForPendingNodeClaims waits 4 minutes for machine creation, followed by ensureNodePlugins, which times out after 1 minute—so worst case, deletion should happen within 5 minutes.
- But since you're using a pre-created machine, labeling correctly should mean you don't have any delays.
Also, I noticed no GPU claim in your container spec—assuming you need one. I’ve updated the docs to help: https://github.com/kaito-project/kaito/pull/935
Thank you, @ishaansehgal99 , I will let you know when I hit the issue next time and share the log from both pods.
@hungry1526 any update on this?