skypilot
skypilot copied to clipboard
[Core] Fix partial GPUs when region specified
Fixes #1258
Tested with the reproducible code in #1258
Tagging @WoosukKwon @infwinston - can you take a look as well?
Is there a potential issue where, say K80:4 is the only K80 config available in a region, and users do sky launch --gpus K80:3 --region ...
. In this case, should we let it succeed (which changes our resource fulfillment semantics to >=
; currently it is exactly satisfiable)?
Is there a potential issue where, say K80:4 is the only K80 config available in a region, and users do
sky launch --gpus K80:3 --region ...
. In this case, should we let it succeed (which changes our resource fulfillment semantics to>=
; currently it is exactly satisfiable)?
The launch-ability will be checked in the optimizer. The following is our error message for that:
> sky launch -c test-gpu --gpus K80:3 --region us-central1 --cloud gcp
I 10-16 17:14:56 optimizer.py:880] No resource satisfying {'K80': 3} on [GCP].
I 10-16 17:14:56 optimizer.py:883] Did you mean: ['K80:4', 'K80:8']
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
'sky check' to check the enabled clouds.
Thanks @Michaelvll for quickly fixing the bug! I've checked that this PR resolves the issue.
However, I have a bit fundamental question: do we really need to call accelerator_in_region_or_zone
(or any other validation check) for sky exec
? To my understanding, comparing the user's resource requirement with the cluster's resource (by less_demanding_than
) should be enough for sky exec
. I believe all the validation checks in Resources
are only for sky launch
, and we need to differentiate the two in order to avoid confusion.
However, I have a bit fundamental question: do we really need to call
accelerator_in_region_or_zone
(or any other validation check) forsky exec
? To my understanding, comparing the user's resource requirement with the cluster's resource (byless_demanding_than
) should be enough forsky exec
. I believe all the validation checks inResources
are only forsky launch
, and we need to differentiate the two in order to avoid confusion.
Great point @WoosukKwon! The resources should only be checked for availability during launching, instead of creation. I modified the PR thoroughly to move the resource validation to optimization and added several tests to make sure the modification is correct. PTAL.
Tested:
- [x]
tests/run_smoke_tests.sh
This has been fixed in the master branch. Closing this PR for now.
Sorry, this issue still exists by the following commands:
sky gpunode --gpus K80:4 --region us-central1 --cloud gcp -c test-gpu
sky exec --gpus K80:3 --cloud gcp --region us-central1 test-gpu 'echo hi'
The error message will be:
ValueError: Accelerator "K80" is not available in "us-central1".
The error is quite confusing.
A kind bump for this PR @WoosukKwon
I further refactored the code in optimizer and clouds to reduce the redundant codes. Please feel free to take another look. @WoosukKwon Tested:
- [x]
pytest tests/test_optimizer_dryrun.py
with the A100 test included. - [x]
pytest tests/test_cli.py
with the job execution on existing cluster tests included. - [x]
sky launch --gpus A100:8 --region us-west-1 --cloud aws
should fail before the confirmation - [x]
pytest tests/test_smoke.py
To reproduce the issue for failing to execute job with GPUs on an existing cluster:
- go to the master branch
-
git cherry-pick ced2878
-
pytest tests/test_jobs.py
Thanks a lot for the review @WoosukKwon!! I am testing the smoke tests before merging:
- [x]
pytest tests/test_smoke.py
(with #2241 merged)