skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

UX/backend: Lower-case GPU succeeds in launching, but block forever in exec

Open concretevitamin opened this issue 2 years ago • 1 comments

Adapted from a programmatic use case from Erick. Here's a CLI repro:

# Succeeds. Because we allow lower-case GPUs in launching.
sky launch -c myclus --gpus v100 ''

# This blocks forever. `sky queue` will show PENDING. 
# This is because [v100:1] does not fit [V100:1].
sky exec myclus --gpus v100 -- echo hi

# Works.
sky exec myclus --gpus V100 -- echo hi

An immediate fix is to canonicalize the gpu string during exec.

A bigger item is probably to programmatically check if some task requirement is not ever going to be satisfied, e.g., exec --gpus some_other_gpu, and immediately fail.

concretevitamin avatar Jul 19 '22 16:07 concretevitamin

#1075 will fix this bug.

A bigger item is probably to programmatically check if some task requirement is not ever going to be satisfied, e.g., exec --gpus some_other_gpu, and immediately fail.

Actually sky exec --gpus some_other_gpus immediately fails by the less_demanding_than test. However, yes we may need to investigate in more detail whether something other than the --gpus argument can make a task requirement unsatisfiable.

WoosukKwon avatar Aug 14 '22 08:08 WoosukKwon