skypilot
skypilot copied to clipboard
[Ray Autoscaler] `ray status`, Accelerator Placement Group
- Launch P3.8x with 4 V100s.
- Submit job with 2 V100.
- Ray Status on the head node seems to have incorrect logging.
data:image/s3,"s3://crabby-images/ec145/ec1453a436e0ee964cd8c8fdadf841f5d614377d" alt="Screen Shot 2022-04-21 at 6 40 59 PM"
This could be a Ray bug.
Did you launch the task with sky launch
with the generated ray program? For ray resource allocation, it is a book keeping system. Despite num_gpus=2
for a pg, you may also need to specify resources={"V100": 2}
in the pg as well. reference
Yes, this was launching two jobs with custom_resource={"V100": 2}
.
Verified on master.
Is this a problem that we need to fix for on-prem, namely do we explicitly ask on-prem users/admins to use ray status
? If so, we can use https://github.com/sky-proj/sky/issues/807#issuecomment-1120114156 this patch to fix this. (This patch fixes accelerator_type:V100
but not V100
.) Good target for bug squash party.
Admins only use ray status
after they set up Ray cluster and to check if it is running. Otherwise, I think this fix works.
Can you apply and push that fix? Good target for bug squash!
Can this be closed @michaelzhiluo?