skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

Spot job stall at STARTING status

Open lhqing opened this issue 2 years ago • 5 comments

Hi Sky team,

Problem I noticed some spot job stall at STARTING status for long time (>1 hours), I have to manually cancel it or ssh into controller to run sky down STUCK_JOB_NAME to close that VM.

This usually happens when I submit the first spot job in a new controller, but not always reproducible. And sometimes it also happens in the middle spot jobs. The controller seems normal, because I can submit the same kind of job successfully.

Environment I am using the most up-to-date sky every time by run git pull and pip install ".[gcp]" and my environment is GCP with os-login ON.

Logs Here is a provision.log for the first job that submit to controller, stuck >1 hour after the final SSH failed error: sky_logs_spot_job_provison_stuck.log

The spot job YAML file: sky_logs_spot_job.yaml.txt

there is no setup.log and run.log for this

lhqing avatar Aug 28 '22 20:08 lhqing

Thanks for the report @lhqing. At the end of the provision log somehow the command

((ps aux | grep "-m sky.skylet.skylet" | grep -q python3) || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &); export SKY_NUM_GPUS=0 && which nvidia-smi > /dev/null && SKY_NUM_GPUS=$(nvidia-smi --query-gpu=index,name --format=csv,noheader | wc -l); ray stop; ray start --disable-usage-stats --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml  --num-gpus=$SKY_NUM_GPUS

failed.

This should not have blocked, but instead should show something like

I 08-29 16:44:05 spot_utils.py:217] INFO: The log is not ready yet, as the spot job is STARTING. Waiting for 20 seconds.
I 08-29 16:44:25 spot_utils.py:246] Logs finished for job 1 (status: FAILED_CONTROLLER).
Shared connection to 34.234.225.0 closed.
I 08-29 09:44:25 cloud_vm_ray_backend.py:1991] Job ID: 1

I failed to reproduce this - when this happens next time, maybe log into the controller and use py-spy dump --pid <pid> to see what the stuck controller process is doing?

concretevitamin avatar Aug 29 '22 18:08 concretevitamin

Thanks @concretevitamin

Today I had another spot VM stop at the STARTING for more than 1 hour. It seems happen after pre-emption.

The run.log, and provision.log attached stuck-starting-log_sky-2022-09-02-06-09-35-874213_provision.log stuck-starting-log_run.log

And the py-spy output:

py-spy dump --pid 14334
Process 14334: ray::worker-0901-m3c-loop-0
Python v3.7.12 (/opt/conda/bin/python3.7)

Thread 14334 (idle): "MainThread"
    select (selectors.py:468)
    process_subprocess_stream (sky_job_78:66)
    run_with_log (sky_job_78:228)
    run_bash_command_with_log (sky_job_78:308)
    main_loop (ray/worker.py:455)
    <module> (ray/workers/default_worker.py:238)
Thread 14390 (idle): "ray_import_thread"
    wait (threading.py:300)
    _wait_once (grpc/_common.py:106)
    wait (grpc/_common.py:148)
    result (grpc/_channel.py:735)
    _poll_locked (ray/_private/gcs_pubsub.py:249)
    poll (ray/_private/gcs_pubsub.py:385)
    _run (ray/_private/import_thread.py:70)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Thread 24994 (idle): "Thread-11"
    channel_spin (grpc/_channel.py:1258)
    run (threading.py:870)
    _bootstrap_inner (threading.py:926)
    _bootstrap (threading.py:890)
Pasted Graphic

lhqing avatar Sep 02 '22 06:09 lhqing

In GCP's console, does the cluster worker-0901-m3c-loop-0-78 exist? If so, is it in a stopped state (indicating being preempted)?

concretevitamin avatar Sep 02 '22 15:09 concretevitamin

In GCP's console, does the cluster worker-0901-m3c-loop-0-78 exist? If so, is it in a stopped state (indicating being preempted)?

I forgot to take a screenshot... will make sure to do so next time. But the worker-0901-m3c-loop-0-78 status was Running, and the CPU was constantly 1%.

lhqing avatar Sep 02 '22 17:09 lhqing

@Michaelvll logs for another stuck example, VM is in running status provision.log run.log storage_mounts.log

image

lhqing avatar Sep 05 '22 21:09 lhqing

This problem should have been mitigated by #1150. I am closing it for now, but feel free to re-open this issue if the problem occurs again.

Michaelvll avatar Oct 20 '22 21:10 Michaelvll

@Michaelvll I think I have a similar issue.. Provisioning works well and I can ssh into the spot instance. However, the task doesn't seem to be executed and the status is stalled at STARTING.

(sky) sieunpark@Sieuns-MacBook-Pro skypilot-test % ray exec /Users/sieunpark/.sky/generated/sky-spot-controller-67250249.yml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
2023-07-18 07:17:21,007 WARNING util.py:253 -- Dropping the empty legacy field head_node. head_nodeis not supported for ray>=2.0.0. It is recommended to removehead_node from the cluster config.
2023-07-18 07:17:21,007 WARNING util.py:253 -- Dropping the empty legacy field worker_nodes. worker_nodesis not supported for ray>=2.0.0. It is recommended to removeworker_nodes from the cluster config.
2023-07-18 07:17:21,008 INFO util.py:376 -- setting max workers for head node type to 0
Loaded cached provider configurationIf you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 3.86.83.180
Warning: Permanently added '3.86.83.180' (ED25519) to the list of known hosts.
tail: cannot open '/tmp/ray/session_latest/logs/monitor*' for reading: No such file or directorytail: no files remaining
Shared connection to 3.86.83.180 closed.
Error: Command failed:

  ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_c8c038807e/3a0d425a1e/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (tail -n 100 -f /tmp/ray/session_latest/logs/monitor*)'

When I run the command on its own,

(sky) sieunpark@Sieuns-MacBook-Pro skypilot-test % ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_c8c038807e/3a0d425a1e/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (tail -n 100 -f /tmp/ray/session_latest/logs/monitor*)'
Warning: Permanently added '3.86.83.180' (ED25519) to the list of known hosts.
.bashrc: source: filename argument required
source: usage: source filename [arguments]
tail: cannot open '/tmp/ray/session_latest/logs/monitor*' for reading: No such file or directory
tail: no files remaining

In the spot instance, there actually is no file that matches monitor*.

(base) ubuntu@ip-172-31-81-247:/tmp/ray/session_latest/logs$ ls
events  old  python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_23282.log

Config file:

# hello_sky.yaml
name: hello-sky

num_nodes: 1
# Per-node resource requirements.
resources:
  # Optional; if left out, automatically pick the cheapest cloud.
  cloud: aws
  # region: us-west-2
  # 1x NVIDIA V100 GPU
  # accelerators: V100:1

# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: .

# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
  echo "Running setup."

# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
  echo "Hello, SkyPilot!"
  echo "ls result:"
  ls
  echo "pwd result:" 
  pwd
  conda env list

sieu-n avatar Jul 17 '23 22:07 sieu-n

Probably not much here, but logs I can find..

file_mounts.log provision.log setup-3.86.83.180.log

sieu-n avatar Jul 17 '23 22:07 sieu-n

Thanks @sieu-n for reporting this! Could you create a new issue for this with the version of SkyPilot you are using? Also, it would worth noticing that there is a recent problem with the dependency, pyyaml, we rely on causing issue with the job submission. https://github.com/skypilot-org/skypilot/pull/2256 Could you try to run the your job with the latest master branch to see if that fix the issue?

Michaelvll avatar Jul 17 '23 22:07 Michaelvll

Sure, I will reset the env using the most recent version and raise a new issue if it still occurs. However, this issue was bothering me for ~ a week so I'm not really sure if it will resolve it..

sieu-n avatar Jul 17 '23 22:07 sieu-n

Was this issue ever resolved? I am having the same problem on AWS right now. But I can't seem to find the solution.

kishb87 avatar Jan 12 '24 02:01 kishb87

I was able to resolve this issue. I needed certain permissions within my policy.

Issues can vary but to troubleshoot, try sky spot logs --controller {YOUR JOB ID} to see the output of the controller.

kishb87 avatar Jan 12 '24 04:01 kishb87