FAILED_CONTROLLER after a preemption, no error in logs
My job failed due to FAILED_CONTROLLER after a preemption.
sky spot logs --controller 5 doesn't show any error:
(small-yt, pid=45563) I 03-12 15:33:03 spot_utils.py:92] ================================== (small-yt, pid=45563) I 03-12 15:33:23 spot_utils.py:83] === Checking the job status... === (small-yt, pid=45563) I 03-12 15:33:26 spot_utils.py:89] Job status: JobStatus.RUNNING (small-yt, pid=45563) I 03-12 15:33:26 spot_utils.py:92] ================================== (small-yt, pid=45563) I 03-12 15:33:46 spot_utils.py:83] === Checking the job status... === (small-yt, pid=45563) I 03-12 15:33:49 spot_utils.py:89] Job status: JobStatus.RUNNING (small-yt, pid=45563) I 03-12 15:33:49 spot_utils.py:92] ================================== Shared connection to 35.204.42.245 closed
Please tell me if there is any other info I can share to help understand what may have caused it I am using GCS and skypilot version 0.5.0
Thanks for reporting this @Hubert-Bonisseur! This is quite weird. Possibly the controller process is somehow killed.
Could you share how many spot jobs you were running concurrently and if you have seen any issue with the other spot jobs?
It would be nice to share the job task yaml you were running as well : )
I was running only one job at that time. I since launched 2 concurrent spot jobs and they are working fine so far, but there hasn't been a preemption yet. I will update then.
Here is the task.yml
name: small-yt
resources:
cloud: gcp
region: europe-west4
cpus: 12+
accelerators: A100
memory: 6+
disk_size: 500
disk_tier: 'medium'
file_mounts:
~/secret/service_account.json: /Users/datalab/épellations/STT/finetune/secrets/finetuning-414911-4d293f61509f.json
envs:
COMMIT: b11a10fb86feef059d4798ff883ea719e4169218
MODEL_ID: small-yt-V2
NUM_WORKERS: 12
setup: |
echo "Begin setup."
sudo apt-get update
sudo apt-get -y install ffmpeg
cd ~/sky_workdir
git clone [email protected]:data-science/speech/finetune.git
cd finetune_whisper
git checkout $COMMIT
pip install -r requirements.txt
pip install "git+https://github.com/skypilot-org/skypilot.git#egg=sky-callback&subdirectory=sky/callbacks/"
mkdir ~/checkpoints/
if gsutil ls gs://finetuning-checkpoints/$MODEL_ID; then
gsutil -m cp -r gs://finetuning-checkpoints/$MODEL_ID/* ~/checkpoints/
else
echo "Remote folder does not exist. Starting a new training run"
fi
echo "Setup complete."
run: |
echo "Beginning task."
cd finetune
export GOOGLE_APPLICATION_CREDENTIALS=$(realpath ~/secret/service_account.json)
export PYTHONPATH=$PWD
python finetune run configs/training_config_mosaicML.yml