skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[Managed Spot] Features required in the managed spot

Open Michaelvll opened this issue 2 years ago • 9 comments

Future TODOs:

  • [x] #799
  • [ ] #800
  • [x] #865
  • [ ] Better loggings for the spot controller outputs
  • [x] Examples: Test real resnet training with spot launch
  • [ ] Reduce the number of jobs to be shown in the sky spot status by default.
  • [x] Design tests
  • [x] #845
  • [ ] Multi-node spot job.
  • [ ] Currently, we can only launch 2xCPU in spot controller instance, as our controllers are submitted as sky jobs (each takes 0.5 CPU by default).
  • [x] There will be a worker launched after the head node is killed even for the single-node case. It may be related to the ray autoscaling. (This should have been solved by #780)
  • [x] #790
  • [x] #774
  • [x] #777
  • [x] #778
  • [x] Error out on cancel if the controller has already abnormally exited its control loop. (https://github.com/sky-proj/sky/issues/771#issuecomment-1118124565) (#784)

Requested features

  • [ ] Show detailed provision/setup logs for the user in sky spot logs. (https://github.com/sky-proj/sky/pull/798#issuecomment-1120221669) (We can use archive logs #800)
  • [x] Remove STARTED column and add complete resource str in sky spot status -a (https://github.com/sky-proj/sky/issues/771#issuecomment-1120115106) (Fixed by #823)
  • [ ] Make RECOVERIES column: 0 (1 attempt), 2 (2 attempts) (https://github.com/sky-proj/sky/issues/771#issuecomment-1118124565)
  • [x] #828
  • [ ] Show region and resources information afterthe job finished. (https://github.com/sky-proj/sky/issues/771#issuecomment-1126542448)
  • [x] #861 (https://github.com/sky-proj/sky/issues/771#issuecomment-1118124565)
  • [x] #1073

Michaelvll avatar Apr 29 '22 02:04 Michaelvll

Seeing the following issue. The controller is managing a running spot job. I ran sky autostop --all -i 1. Then, then controller has been stopped, despite the job still running (see the info printed after recovery, also verified in console):

» sky spot status                                                                                                          1 ↵
Fetching managed spot job statuses...
Spot controller sky-spot-controller is STOPPED.
To view the latest job table: sky spot status --refresh

Cached job status table [last updated: 56 secs ago]:
ID  NAME               RESOURCES    SUBMITTED  TOT. DURATION  STARTED   JOB DURATION  #RECOVERIES  STATUS
1   sky-6c90-zongheng  1x [V100:1]  1 hr ago   1h 32m 53s     1 hr ago  1h 28m 53s    0            RUNNING


» sky spot status --refresh
Fetching managed spot job statuses...
Spot controller sky-spot-controller is STOPPED.

Restarting controller for latest status...
I 04-30 20:26:51 cloud_vm_ray_backend.py:761] To view detailed progress: tail -n100 -f /Users/zongheng/sky_logs/sky-2022-04-30-20-26-48-561882/provision.log
I 04-30 20:26:54 cloud_vm_ray_backend.py:608] Cluster 'sky-spot-controller' (status: STOPPED) was previously launched in AWS (us-east-1). Relaunching in that region.
I 04-30 20:26:55 cloud_vm_ray_backend.py:935] Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f)
I 04-30 20:27:48 log_utils.py:45] Head node is up.
I 04-30 20:28:21 cloud_vm_ray_backend.py:835] Successfully provisioned or found existing VM.
Managed spot jobs:
ID  NAME               RESOURCES    SUBMITTED  TOT. DURATION  STARTED    JOB DURATION  #RECOVERIES  STATUS
1   sky-6c90-zongheng  1x [V100:1]  4 hrs ago  4h 24m 39s     4 hrs ago  4h 20m 39s    0            RUNNING

» sky spot status -a
Fetching managed spot job statuses...
Managed spot jobs:
ID  NAME               RESOURCES    SUBMITTED  TOT. DURATION  STARTED    JOB DURATION  #RECOVERIES  STATUS   CLUSTER                                REGION    
1   sky-6c90-zongheng  1x [V100:1]  4 hrs ago  4h 25m 4s      4 hrs ago  4h 21m 4s     0            RUNNING  1x AWS(p3.2xlarge[Spot], {'V100': 1})  us-west-2 

concretevitamin avatar May 01 '22 03:05 concretevitamin

Seeing the following issue. The controller is managing a running spot job. I ran sky autostop --all -i 1. Then, the controller has been stopped, despite the job still running (see the info printed after recovery, also verified in console)

This is very weird... It may be related to the autostop feature, but I think we will not stop the instance if there is any job in the job queue that is still running. By checking the console, do you mean the controller actually stopped?

Michaelvll avatar May 02 '22 03:05 Michaelvll

By checking the console, do you mean the controller actually stopped?

I meant that I checked in EC2 console and saw the spot instance is still running; the Spot Requests page also showed the request is live since ~2 days ago, which means it didn't experience any interruption. Since the controller is stopped, the spot job is leaked.

Are you able to reproduce this? Here are the two table entries from the restarted controller:

$ sqlite3 ~/.sky/jobs.db
...
sqlite> select * from jobs;
1|sky-6c90-zongheng|zongheng|1651359821|CANCELLED|sky-2022-04-30-16-00-24-440757|1651359825|1651362263|1x [CPU:1]
sqlite>

$ sqlite3 ~/.sky/spot_jobs.db
...
sqlite> select * from spot;
1|sky-6c90-zongheng|1x [V100:1]|1651359826.74029|RUNNING|sky-2022-04-30-23-03-46-740245|1651360066.70621||1651360066.70621|0|0.0
sqlite>

Status after refresh

Managed spot jobs:
ID  NAME               RESOURCES    SUBMITTED  TOT. DURATION      STARTED    JOB DURATION       #RECOVERIES  STATUS
1   sky-6c90-zongheng  1x [V100:1]  1 day ago  1 day 16h 28m 57s  1 day ago  1 day 16h 24m 57s  0            RUNNING

Now that it's restarted, I ran sky spot cancel 1 and waited a few minutes; status --refresh still shows RUNNING. Here we may want to error out on cancel if the controller has already abnormally exited its control loop.

concretevitamin avatar May 02 '22 15:05 concretevitamin

UX proposal: Shall we change #RECOVERIES to RECOVERIES and have it show, e.g., 0 (1 attempt), 2 (2 attempts), etc.?

Reason is it's a little weird to see a spot job to run for a long time, ended up in FAILED state and with the current #RECOVERIES showing 0.


UX proposal, needs designing: It might be good to split FAILED into user code failure, recovery failure, and potential other failures. I don't feel like this is very urgent though until we hear about user complaints.

concretevitamin avatar May 05 '22 03:05 concretevitamin

UX comments for sky spot status:

  1. SUBMITTED and STARTED become the same very soon (e.g., 7 hrs ago). One way is to show detailed timestamps (2022-05-05 18:02 vs. 2022-05-05 18:03). Another way is to remove STARTED. This might be ok, as after a few recoveries the STARTED column becomes less meaningful.

  2. I mostly use -a because I’d like to see the cloud-region. Maybe we should show the Resources str by default, just like sky status? Or some other better treatment.

concretevitamin avatar May 07 '22 02:05 concretevitamin

UX comments for sky spot status:

  1. SUBMITTED and STARTED become the same very soon (e.g., 7 hrs ago). One way is to show detailed timestamps (2022-05-05 18:02 vs. 2022-05-05 18:03). Another way is to remove STARTED. This might be ok, as after a few recoveries the STARTED column becomes less meaningful.
  2. I mostly use -a because I’d like to see the cloud-region. Maybe we should show the Resources str by default, just like sky status? Or some other better treatment.

Good point! Thank you!

  1. I think it is fine to remove the STARTED column.
  2. For the resource, I used the simplified str to fit the serverless concept, where user do not need to know the cluster being launched. Do you think we can show simplified str in the normal spot status, but use the complete str for the -a?

Michaelvll avatar May 07 '22 04:05 Michaelvll

It seems that the job duration is at least 1 minute due to our polling frequency, even for jobs that should finish much sooner. This showed up in some smoke tests failing, where we grep for SUCCEEDED for spot jobs after sleeping for just a few seconds.

To reproduce:

sky spot launch 'echo this should show job duration 1s' -n one-sec

We see its job duration is 1m1s

ID  NAME                                RESOURCES   SUBMITTED    TOT. DURATION  STARTED      JOB DURATION  #RECOVERIES  STATUS
6   one-sec                             1x [CPU:1]  3 mins ago   3m 22s         1 min ago    1m 1s         0            SUCCEEDED

I'd say this is minor if it's nontrivial to fix.

concretevitamin avatar May 11 '22 16:05 concretevitamin

Found during the large-scale launch test:

Can we show CLUSTER and REGION even after the job has entered in a terminal state? This seems helpful for the user to know where the job "last" resides.

Currently they are* shown when a job's in a non-terminal state, because these 2 columns are derived from their handle. After the job cluster is down, the handle is gone, so we show dashes.

(*) There's a bug here, it seems. 1 out of 100 jobs even if they have succeeded, could show these two columns:

...
204  n3-19    1x [V100:1]    18m 30s ago  5m 54s         1s              0            SUCCEEDED  12m 36s ago    -                                        -
206  n3-33    1x [V100:1]    18m 30s ago  5m 53s         1s              0            SUCCEEDED  12m 37s ago    1x GCP(n1-highmem-8[Spot], {'V100': 1})  us-west1
207  n3-61    1x [V100:1]    18m 30s ago  5m 53s         3s              0            SUCCEEDED  12m 39s ago    -                                        -
...

concretevitamin avatar May 13 '22 22:05 concretevitamin

One issue found in test-inline-spot-env:

This smoke test failed on

sky spot status | grep test-inline-spot-env-zongheng-4edc | grep SUCCEEDED

I ran the first grep and saw

407 test-inline-spot-env-zongheng-4edc 1x [CPU:0.5] 21 mins ago 3m 18s - 0 FAILED_NO_RESOURCE

which indicates FAILED_NO_RESOURCE.

I then did sky logs sky-spot-controller 407 and found it's failing due to some other error:

...
(test-inline-spot-env-zongheng-4edc pid=3475) Traceback (most recent call last):
(test-inline-spot-env-zongheng-4edc pid=3475)   File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/execution.py", line 136, in _execute
(test-inline-spot-env-zongheng-4edc pid=3475)     handle = backend.provision(task,
(test-inline-spot-env-zongheng-4edc pid=3475)   File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/utils/timeline.py", line 141, in _record
(test-inline-spot-env-zongheng-4edc pid=3475)     return f(*args, **kwargs)
(test-inline-spot-env-zongheng-4edc pid=3475)   File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/backends/cloud_vm_ray_backend.py", line 1322, in provision
(test-inline-spot-env-zongheng-4edc pid=3475)     backend_utils.check_cluster_name_is_valid(cluster_name)
(test-inline-spot-env-zongheng-4edc pid=3475)   File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/backends/backend_utils.py", line 1441, in check_cluster_name_is_valid
(test-inline-spot-env-zongheng-4edc pid=3475)     raise ValueError(
(test-inline-spot-env-zongheng-4edc pid=3475) ValueError: Cluster name 'test-inline-spot-env-zongheng-4edc-407' has 38 chars; maximum length is 37 chars.
(test-inline-spot-env-zongheng-4edc pid=3475)
...
(test-inline-spot-env-zongheng-4edc pid=3475) I 06-12 19:48:51 recovery_strategy.py:82] Failed to launch the spot cluster.
(test-inline-spot-env-zongheng-4edc pid=3475) E 06-12 19:48:51 controller.py:131] Resources unavailable: Failed to launch the spot cluster after 3 retries.
(test-inline-spot-env-zongheng-4edc pid=3475) I 06-12 19:48:51 spot_state.py:185] Job failed due to failing to find available resources after retries.

which surfaces now because I've accumulated 400+ spot jobs.

Issues:

  • [ ] can we distinguish this kind of error from FAILED_NO_RESOURCE?
  • [ ] orthogonal: as number of spot jobs increases, we need a better way to ensure cluster names of spot jobs not exceeding length limit

concretevitamin avatar Jun 12 '22 20:06 concretevitamin