skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[Ray Autoscaler] `ray status`, Accelerator Placement Group

Open michaelzhiluo opened this issue 2 years ago • 7 comments

  1. Launch P3.8x with 4 V100s.
  2. Submit job with 2 V100.
  3. Ray Status on the head node seems to have incorrect logging.
Screen Shot 2022-04-21 at 6 40 59 PM

This could be a Ray bug.

michaelzhiluo avatar Apr 22 '22 01:04 michaelzhiluo

Did you launch the task with sky launch with the generated ray program? For ray resource allocation, it is a book keeping system. Despite num_gpus=2 for a pg, you may also need to specify resources={"V100": 2} in the pg as well. reference

Michaelvll avatar Apr 22 '22 04:04 Michaelvll

Yes, this was launching two jobs with custom_resource={"V100": 2}.

michaelzhiluo avatar Apr 22 '22 07:04 michaelzhiluo

Verified on master.

michaelzhiluo avatar Apr 24 '22 22:04 michaelzhiluo

Is this a problem that we need to fix for on-prem, namely do we explicitly ask on-prem users/admins to use ray status? If so, we can use https://github.com/sky-proj/sky/issues/807#issuecomment-1120114156 this patch to fix this. (This patch fixes accelerator_type:V100 but not V100.) Good target for bug squash party.

concretevitamin avatar Jun 01 '22 00:06 concretevitamin

Admins only use ray status after they set up Ray cluster and to check if it is running. Otherwise, I think this fix works.

michaelzhiluo avatar Jun 02 '22 19:06 michaelzhiluo

Can you apply and push that fix? Good target for bug squash!

concretevitamin avatar Jun 02 '22 21:06 concretevitamin

Can this be closed @michaelzhiluo?

concretevitamin avatar Oct 12 '22 01:10 concretevitamin