skypilot
skypilot copied to clipboard
[Spot] Auto-translated bucket leakage if the spot job is not submitted correctly
Currently, the buckets generated by the auto file mounts translation will be leaked if the spot job is not submitted correctly, as we rely on persistent: False option in the skypilot storage, which will only take effect when the spot job successfully submitted to controller and managed by it.
Related to #1280
This has been raised by a user, who has run into bucket limit errors due to too many zombie buckets not being cleaned up (skypilot-filemounts-files-<user>-<hash>).
While working on #1280, I'd like to understand what this related issue is about and have some questions @concretevitamin @Michaelvll.
auto file mounts translation
- This is referring to
execution.py/_maybe_translate_local_file_mounts_and_sync_up(), right?
spot job is not submitted correctly
- What does it mean for spot job to be not submitted correctly?
too many zombie buckets not being cleaned up
- This is the same problem as #1313 ?
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This has been raised by a user, who has run into bucket limit errors due to too many zombie buckets not being cleaned up (
skypilot-filemounts-files-<user>-<hash>).
This same issue has been raised by users again.
With #2322, I think this should be much easier to solve since we know store exactly which storage objects are attached to the cluster. cc @landscapepainter
Will look into this :)
This needs some thought. For example, if a user ctrl-c the following, where workdir/filemounts have been uploaded to buckets and before any cluster is launched:
» sky spot launch test-acc.yaml
Task from YAML spec: test-acc.yaml
Managed spot job 'sky-aa5a-zongheng' will be launched on (estimated):
I 11-22 15:21:49 optimizer.py:694] == Optimizer ==
I 11-22 15:21:49 optimizer.py:706] Target: minimizing cost
I 11-22 15:21:49 optimizer.py:717] Estimated cost: $0.0 / hour
I 11-22 15:21:49 optimizer.py:717]
I 11-22 15:21:49 optimizer.py:841] Considered resources (1 node):
I 11-22 15:21:49 optimizer.py:910] ----------------------------------------------------------------------------------------------------------------------------
I 11-22 15:21:49 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 11-22 15:21:49 optimizer.py:910] ----------------------------------------------------------------------------------------------------------------------------
I 11-22 15:21:49 optimizer.py:910] GCP n2-standard-8[Spot] 8 32 - northamerica-northeast2-a 0.04 ✔
I 11-22 15:21:49 optimizer.py:910] OCI VM.Standard.E4.Flex$_8_32[Spot] 8 32 - eJQi:US-SANJOSE-1-AD-1 0.07
I 11-22 15:21:49 optimizer.py:910] AWS m6i.2xlarge[Spot] 8 32 - ap-northeast-3b 0.09
I 11-22 15:21:49 optimizer.py:910] ----------------------------------------------------------------------------------------------------------------------------
I 11-22 15:21:49 optimizer.py:910]
Launching the spot job 'sky-aa5a-zongheng'. Proceed? [Y/n]:
I 11-22 15:21:51 controller_utils.py:324] Translating workdir to SkyPilot Storage...
I 11-22 15:21:51 controller_utils.py:349] Workdir 'llm/axolotl' will be synced to cloud storage 'skypilot-workdir-zongheng-413dc16d'.
I 11-22 15:21:51 controller_utils.py:422] Uploading sources to cloud storage. See: sky storage ls
I 11-22 15:21:53 storage.py:1782] Created GCS bucket skypilot-workdir-zongheng-413dc16d in US-CENTRAL1 with storage class STANDARD
Launching managed spot job 'sky-aa5a-zongheng' from spot controller...
Launching spot controller...
the bucket is now leaked.
Failure to submit a spot job is the current hypothesis on why zombie buckets kept accumulating. We should figure out when to clean up such leaked buckets.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.