skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

Add host VM - GPU compatibility checks for GCP

Open WoosukKwon opened this issue 2 years ago • 1 comments

This PR checks compatibility between GCP host VMs and accelerators. For example, GPUs (except A100) can be only attached to N1 machines, and each GPU has limitations on the number of vCPUs and amount of CPU memory that its host VM can have. This PR hard-codes such information in the GCP catalog and lets users know when their requests are invalid.

Tested:

  • sky gpunode --instance-type n1-highmem-16 --gpus K80 -c test (invalid)
  • sky gpunode --instance-type n1-highmem-16 --gpus K80:2 -c test (valid)
  • sky gpunode --instance-type n1-highmem-16 --gpus A10G -c test (invalid)
  • sky gpunode --instance-type a2-highgpu-1g -c test (invalid)
  • sky gpunode --instance-type a2-highgpu-1g --gpus A100:2 -c test (invalid)
  • sky gpunode --instance-type n1-highcpu-16 --gpus A100 -c test (invalid)

WoosukKwon avatar Jul 18 '22 06:07 WoosukKwon

@concretevitamin Thanks for your review! While I addressed all of your comments, I found that this PR breaks sky exec and sky launch -c existing-cluster. For existing clusters, we only need to check if the resource request is less demanding than what the cluster has. Thus, the check_host_accelerator_availability function should be called only when a new cluster is launched.

I found that such a compatibility check is also needed for other clouds and filed the issue #1025.

WoosukKwon avatar Jul 30 '22 07:07 WoosukKwon

@concretevitamin I made the compatibility check invoked by the optimizer. Now this PR does not break sky launch and sky exec on existing clusters. However, a slight downside of this implementation is that in sky spot launch the compatibility check is not made until the spot controller runs the optimizer. I think we can address this in a future PR. PTAL.

WoosukKwon avatar Aug 29 '22 21:08 WoosukKwon

I changed the implementation substantially. The PR now consists of two new functions check_host_accelerator_compatibility and check_accelerator_attachable_to_host.

The first check_host_accelerator_compatibility function is invoked when Resources objects are created. It simply checks that accelerators are used with N1 machines, and does NOT check the maximum vCPU count and maximum memory limits for the accelerator because any Resources like GCP(n1-highmem-64, {'V100': 0.01} are allowed for sky exec.

The second check_accelerator_attachable_to_host function checks the cpu and memory limits. It is invoked by the optimizer, so sky exec will not execute this function.

@concretevitamin Could you please take another look?

WoosukKwon avatar Aug 31 '22 05:08 WoosukKwon

I've checked that this PR does not break any smoke test.

WoosukKwon avatar Aug 31 '22 19:08 WoosukKwon

@concretevitamin If you don't have any more concern about this PR, I'll merge it.

WoosukKwon avatar Aug 31 '22 21:08 WoosukKwon

Let’s ship it!

concretevitamin avatar Aug 31 '22 22:08 concretevitamin

Let’s ship it!

On Wed, Aug 31, 2022 at 14:31 Woosuk Kwon @.***> wrote:

@concretevitamin https://github.com/concretevitamin If you don't have any more concern about this PR, I'll merge it.

— Reply to this email directly, view it on GitHub https://github.com/skypilot-org/skypilot/pull/989#issuecomment-1233441521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQWHUJYT5XIY3CKBCLAH3V37FJ5ANCNFSM533DAXVQ . You are receiving this because you were mentioned.Message ID: @.***>

concretevitamin avatar Oct 11 '22 06:10 concretevitamin