skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[Spot] Auto-translated bucket leakage if the spot job is not submitted correctly

Open Michaelvll opened this issue 3 years ago • 8 comments

Currently, the buckets generated by the auto file mounts translation will be leaked if the spot job is not submitted correctly, as we rely on persistent: False option in the skypilot storage, which will only take effect when the spot job successfully submitted to controller and managed by it.

Michaelvll avatar Oct 11 '22 19:10 Michaelvll

Related to #1280

Michaelvll avatar Oct 20 '22 21:10 Michaelvll

This has been raised by a user, who has run into bucket limit errors due to too many zombie buckets not being cleaned up (skypilot-filemounts-files-<user>-<hash>).

concretevitamin avatar Apr 11 '23 16:04 concretevitamin

While working on #1280, I'd like to understand what this related issue is about and have some questions @concretevitamin @Michaelvll.

auto file mounts translation

  1. This is referring to execution.py/_maybe_translate_local_file_mounts_and_sync_up(), right?

spot job is not submitted correctly

  1. What does it mean for spot job to be not submitted correctly?

too many zombie buckets not being cleaned up

  1. This is the same problem as #1313 ?

landscapepainter avatar May 22 '23 02:05 landscapepainter

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Sep 20 '23 02:09 github-actions[bot]

This has been raised by a user, who has run into bucket limit errors due to too many zombie buckets not being cleaned up (skypilot-filemounts-files-<user>-<hash>).

This same issue has been raised by users again.

concretevitamin avatar Nov 21 '23 21:11 concretevitamin

With #2322, I think this should be much easier to solve since we know store exactly which storage objects are attached to the cluster. cc @landscapepainter

romilbhardwaj avatar Nov 22 '23 02:11 romilbhardwaj

Will look into this :)

landscapepainter avatar Nov 22 '23 02:11 landscapepainter

This needs some thought. For example, if a user ctrl-c the following, where workdir/filemounts have been uploaded to buckets and before any cluster is launched:

» sky spot launch test-acc.yaml
Task from YAML spec: test-acc.yaml
Managed spot job 'sky-aa5a-zongheng' will be launched on (estimated):
I 11-22 15:21:49 optimizer.py:694] == Optimizer ==
I 11-22 15:21:49 optimizer.py:706] Target: minimizing cost
I 11-22 15:21:49 optimizer.py:717] Estimated cost: $0.0 / hour
I 11-22 15:21:49 optimizer.py:717]
I 11-22 15:21:49 optimizer.py:841] Considered resources (1 node):
I 11-22 15:21:49 optimizer.py:910] ----------------------------------------------------------------------------------------------------------------------------
I 11-22 15:21:49 optimizer.py:910]  CLOUD   INSTANCE                          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                 COST ($)   CHOSEN
I 11-22 15:21:49 optimizer.py:910] ----------------------------------------------------------------------------------------------------------------------------
I 11-22 15:21:49 optimizer.py:910]  GCP     n2-standard-8[Spot]               8       32        -              northamerica-northeast2-a   0.04          ✔
I 11-22 15:21:49 optimizer.py:910]  OCI     VM.Standard.E4.Flex$_8_32[Spot]   8       32        -              eJQi:US-SANJOSE-1-AD-1      0.07
I 11-22 15:21:49 optimizer.py:910]  AWS     m6i.2xlarge[Spot]                 8       32        -              ap-northeast-3b             0.09
I 11-22 15:21:49 optimizer.py:910] ----------------------------------------------------------------------------------------------------------------------------
I 11-22 15:21:49 optimizer.py:910]
Launching the spot job 'sky-aa5a-zongheng'. Proceed? [Y/n]:
I 11-22 15:21:51 controller_utils.py:324] Translating workdir to SkyPilot Storage...
I 11-22 15:21:51 controller_utils.py:349] Workdir 'llm/axolotl' will be synced to cloud storage 'skypilot-workdir-zongheng-413dc16d'.
I 11-22 15:21:51 controller_utils.py:422] Uploading sources to cloud storage. See: sky storage ls
I 11-22 15:21:53 storage.py:1782] Created GCS bucket skypilot-workdir-zongheng-413dc16d in US-CENTRAL1 with storage class STANDARD
Launching managed spot job 'sky-aa5a-zongheng' from spot controller...
Launching spot controller...

the bucket is now leaked.

Failure to submit a spot job is the current hypothesis on why zombie buckets kept accumulating. We should figure out when to clean up such leaked buckets.

concretevitamin avatar Nov 22 '23 23:11 concretevitamin

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Jul 21 '24 01:07 github-actions[bot]