skypilot
skypilot copied to clipboard
Add host VM - GPU compatibility checks for GCP
This PR checks compatibility between GCP host VMs and accelerators. For example, GPUs (except A100) can be only attached to N1 machines, and each GPU has limitations on the number of vCPUs and amount of CPU memory that its host VM can have. This PR hard-codes such information in the GCP catalog and lets users know when their requests are invalid.
Tested:
-
sky gpunode --instance-type n1-highmem-16 --gpus K80 -c test
(invalid) -
sky gpunode --instance-type n1-highmem-16 --gpus K80:2 -c test
(valid) -
sky gpunode --instance-type n1-highmem-16 --gpus A10G -c test
(invalid) -
sky gpunode --instance-type a2-highgpu-1g -c test
(invalid) -
sky gpunode --instance-type a2-highgpu-1g --gpus A100:2 -c test
(invalid) -
sky gpunode --instance-type n1-highcpu-16 --gpus A100 -c test
(invalid)
@concretevitamin Thanks for your review! While I addressed all of your comments, I found that this PR breaks sky exec
and sky launch -c existing-cluster
. For existing clusters, we only need to check if the resource request is less demanding than what the cluster has. Thus, the check_host_accelerator_availability
function should be called only when a new cluster is launched.
I found that such a compatibility check is also needed for other clouds and filed the issue #1025.
@concretevitamin I made the compatibility check invoked by the optimizer. Now this PR does not break sky launch
and sky exec
on existing clusters. However, a slight downside of this implementation is that in sky spot launch
the compatibility check is not made until the spot controller runs the optimizer. I think we can address this in a future PR. PTAL.
I changed the implementation substantially. The PR now consists of two new functions check_host_accelerator_compatibility
and check_accelerator_attachable_to_host
.
The first check_host_accelerator_compatibility
function is invoked when Resources
objects are created. It simply checks that accelerators are used with N1 machines, and does NOT check the maximum vCPU count and maximum memory limits for the accelerator because any Resources like GCP(n1-highmem-64, {'V100': 0.01}
are allowed for sky exec
.
The second check_accelerator_attachable_to_host
function checks the cpu and memory limits. It is invoked by the optimizer, so sky exec
will not execute this function.
@concretevitamin Could you please take another look?
I've checked that this PR does not break any smoke test.
@concretevitamin If you don't have any more concern about this PR, I'll merge it.
Let’s ship it!
Let’s ship it!
On Wed, Aug 31, 2022 at 14:31 Woosuk Kwon @.***> wrote:
@concretevitamin https://github.com/concretevitamin If you don't have any more concern about this PR, I'll merge it.
— Reply to this email directly, view it on GitHub https://github.com/skypilot-org/skypilot/pull/989#issuecomment-1233441521, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQWHUJYT5XIY3CKBCLAH3V37FJ5ANCNFSM533DAXVQ . You are receiving this because you were mentioned.Message ID: @.***>