kaito icon indicating copy to clipboard operation
kaito copied to clipboard

Deleting failed inference workspace using kubectl is stuck.

Open hungry1526 opened this issue 9 months ago • 3 comments

Describe the bug

If the inference pod is in bad status, I tried to delete the inference pod using workspace delete (kubectl delete workspace workspace-custom-llm) however it neither succeeded or fail, I have no insight on whether it is stuck or taking too long to finish.

Steps To Reproduce 1.Create AKS 2.Create GPU nodepool 3.Install kaito via helm. 4.Deploy YAML file with preferred node, and a model ID that does not exist. 5.Delete the failed workspace using kubectl 6.The command is stuck. and can only ctrl-c to cancel it. 7.Delete the pod directly from Portal and it succeeded.

YAML

apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: workspace-custom-llm resource: labelSelector: matchLabels: apps: llm-inference preferredNodes:

  • aks-gputest-23010365-vmss000000 inference: template: spec: containers:
    • name: custom-llm-container image: ghcr.io/kaito-project/kaito/llm-reference-preset:latest command: ["accelerate"] args:
      • "launch"
      • "--num_processes"
      • "1"
      • "--num_machines"
      • "1"
      • "--gpu_ids"
      • "all"
      • "inference_api.py"
      • "--pipeline"
      • "text-generation"
      • "--trust_remote_code"
      • "--allow_remote_files"
      • "--pretrained_model_name_or_path"
      • "Phi-3.5-mini-instruct" # Replace <MODEL_ID> with the specific HuggingFace model identifier
      • "--torch_dtype"
      • "float16" # Set to "float16" for compatibility with V100 GPUs; use "bfloat16" for A100, H100 or newer GPUs volumeMounts:
      • name: dshm mountPath: /dev/shm volumes:
    • name: dshm emptyDir: medium: Memory

Expected behavior Kubectl delete succeeded.

Logs

Environment

  • Kubernetes version (use kubectl version): 1.30.9
  • OS (e.g: cat /etc/os-release): Ubuntu
  • Install tools: kubectl
  • Others:

Additional context

hungry1526 avatar Mar 18 '25 04:03 hungry1526

@hungry1526 Is this error reproducible, and do you have the kaito-pod logs? I couldn’t reproduce it following your steps.

A couple of things to check:

  • Are you adding the label from resourceSpec to the node (in your case llm-inference)? If not, Kaito won’t match the pod to preferredNode and will try to create a new one.
  • In this case, when you start and delete a workspace, nodeclaim.WaitForPendingNodeClaims waits 4 minutes for machine creation, followed by ensureNodePlugins, which times out after 1 minute—so worst case, deletion should happen within 5 minutes.
  • But since you're using a pre-created machine, labeling correctly should mean you don't have any delays.

Also, I noticed no GPU claim in your container spec—assuming you need one. I’ve updated the docs to help: https://github.com/kaito-project/kaito/pull/935

ishaansehgal99 avatar Mar 19 '25 02:03 ishaansehgal99

Thank you, @ishaansehgal99 , I will let you know when I hit the issue next time and share the log from both pods.

hungry1526 avatar Mar 21 '25 01:03 hungry1526

@hungry1526 any update on this?

zhuangqh avatar Apr 28 '25 23:04 zhuangqh