skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

FAILED_CONTROLLER after a preemption, no error in logs

Open bruno-hays opened this issue 1 year ago • 2 comments

My job failed due to FAILED_CONTROLLER after a preemption.

sky spot logs --controller 5 doesn't show any error:

(small-yt, pid=45563) I 03-12 15:33:03 spot_utils.py:92] ================================== (small-yt, pid=45563) I 03-12 15:33:23 spot_utils.py:83] === Checking the job status... === (small-yt, pid=45563) I 03-12 15:33:26 spot_utils.py:89] Job status: JobStatus.RUNNING (small-yt, pid=45563) I 03-12 15:33:26 spot_utils.py:92] ================================== (small-yt, pid=45563) I 03-12 15:33:46 spot_utils.py:83] === Checking the job status... === (small-yt, pid=45563) I 03-12 15:33:49 spot_utils.py:89] Job status: JobStatus.RUNNING (small-yt, pid=45563) I 03-12 15:33:49 spot_utils.py:92] ================================== Shared connection to 35.204.42.245 closed

Please tell me if there is any other info I can share to help understand what may have caused it I am using GCS and skypilot version 0.5.0

bruno-hays avatar Mar 12 '24 17:03 bruno-hays

Thanks for reporting this @Hubert-Bonisseur! This is quite weird. Possibly the controller process is somehow killed.

Could you share how many spot jobs you were running concurrently and if you have seen any issue with the other spot jobs?

It would be nice to share the job task yaml you were running as well : )

Michaelvll avatar Mar 12 '24 17:03 Michaelvll

I was running only one job at that time. I since launched 2 concurrent spot jobs and they are working fine so far, but there hasn't been a preemption yet. I will update then.

Here is the task.yml

name: small-yt

resources:
  cloud: gcp
  region: europe-west4
  cpus: 12+
  accelerators: A100
  memory: 6+

  disk_size: 500
  disk_tier: 'medium'

file_mounts:
  ~/secret/service_account.json: /Users/datalab/épellations/STT/finetune/secrets/finetuning-414911-4d293f61509f.json

envs:
  COMMIT: b11a10fb86feef059d4798ff883ea719e4169218
  MODEL_ID: small-yt-V2
  NUM_WORKERS: 12

setup: |
  echo "Begin setup."
  sudo apt-get update
  sudo apt-get -y install ffmpeg
  cd ~/sky_workdir
  git clone [email protected]:data-science/speech/finetune.git
  cd finetune_whisper
  git checkout $COMMIT
  pip install -r requirements.txt
  pip install "git+https://github.com/skypilot-org/skypilot.git#egg=sky-callback&subdirectory=sky/callbacks/"
  mkdir ~/checkpoints/
  if gsutil ls gs://finetuning-checkpoints/$MODEL_ID; then
    gsutil -m cp -r gs://finetuning-checkpoints/$MODEL_ID/* ~/checkpoints/
  else
    echo "Remote folder does not exist. Starting a new training run"
  fi  
  echo "Setup complete."

run: |
  echo "Beginning task."
  cd finetune
  export GOOGLE_APPLICATION_CREDENTIALS=$(realpath ~/secret/service_account.json)
  export PYTHONPATH=$PWD
  python finetune run configs/training_config_mosaicML.yml

bruno-hays avatar Mar 13 '24 10:03 bruno-hays