skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[Core] Fix partial GPUs when region specified

Open Michaelvll opened this issue 2 years ago • 4 comments

Fixes #1258

Tested with the reproducible code in #1258

Michaelvll avatar Oct 16 '22 23:10 Michaelvll

Tagging @WoosukKwon @infwinston - can you take a look as well?

Is there a potential issue where, say K80:4 is the only K80 config available in a region, and users do sky launch --gpus K80:3 --region .... In this case, should we let it succeed (which changes our resource fulfillment semantics to >=; currently it is exactly satisfiable)?

concretevitamin avatar Oct 16 '22 23:10 concretevitamin

Is there a potential issue where, say K80:4 is the only K80 config available in a region, and users do sky launch --gpus K80:3 --region .... In this case, should we let it succeed (which changes our resource fulfillment semantics to >=; currently it is exactly satisfiable)?

The launch-ability will be checked in the optimizer. The following is our error message for that:

> sky launch -c test-gpu --gpus K80:3 --region us-central1 --cloud gcp                    
I 10-16 17:14:56 optimizer.py:880] No resource satisfying {'K80': 3} on [GCP].
I 10-16 17:14:56 optimizer.py:883] Did you mean: ['K80:4', 'K80:8']
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
      'sky check' to check the enabled clouds.

Michaelvll avatar Oct 17 '22 00:10 Michaelvll

Thanks @Michaelvll for quickly fixing the bug! I've checked that this PR resolves the issue.

However, I have a bit fundamental question: do we really need to call accelerator_in_region_or_zone (or any other validation check) for sky exec? To my understanding, comparing the user's resource requirement with the cluster's resource (by less_demanding_than) should be enough for sky exec. I believe all the validation checks in Resources are only for sky launch, and we need to differentiate the two in order to avoid confusion.

WoosukKwon avatar Oct 17 '22 10:10 WoosukKwon

However, I have a bit fundamental question: do we really need to call accelerator_in_region_or_zone (or any other validation check) for sky exec? To my understanding, comparing the user's resource requirement with the cluster's resource (by less_demanding_than) should be enough for sky exec. I believe all the validation checks in Resources are only for sky launch, and we need to differentiate the two in order to avoid confusion.

Great point @WoosukKwon! The resources should only be checked for availability during launching, instead of creation. I modified the PR thoroughly to move the resource validation to optimization and added several tests to make sure the modification is correct. PTAL.

Tested:

  • [x] tests/run_smoke_tests.sh

Michaelvll avatar Oct 17 '22 22:10 Michaelvll

This has been fixed in the master branch. Closing this PR for now.

Michaelvll avatar May 09 '23 20:05 Michaelvll

Sorry, this issue still exists by the following commands:

sky gpunode --gpus K80:4 --region us-central1 --cloud gcp -c test-gpu
sky exec --gpus K80:3 --cloud gcp --region us-central1 test-gpu 'echo hi'

The error message will be:

ValueError: Accelerator "K80" is not available in "us-central1".

The error is quite confusing.

Michaelvll avatar May 09 '23 20:05 Michaelvll

A kind bump for this PR @WoosukKwon

Michaelvll avatar May 09 '23 22:05 Michaelvll

I further refactored the code in optimizer and clouds to reduce the redundant codes. Please feel free to take another look. @WoosukKwon Tested:

  • [x] pytest tests/test_optimizer_dryrun.py with the A100 test included.
  • [x] pytest tests/test_cli.py with the job execution on existing cluster tests included.
  • [x] sky launch --gpus A100:8 --region us-west-1 --cloud aws should fail before the confirmation
  • [x] pytest tests/test_smoke.py

To reproduce the issue for failing to execute job with GPUs on an existing cluster:

  1. go to the master branch
  2. git cherry-pick ced2878
  3. pytest tests/test_jobs.py

Michaelvll avatar Jul 05 '23 21:07 Michaelvll

Thanks a lot for the review @WoosukKwon!! I am testing the smoke tests before merging:

  • [x] pytest tests/test_smoke.py (with #2241 merged)

Michaelvll avatar Jul 14 '23 04:07 Michaelvll