maxtext
maxtext copied to clipboard
Fix multihost_job cloud logging in case of maintenance event + recovery
Tested via creating a multihost_job, simulating a maintenance event, and confirming we can see the logs even after recovering from the simulated maintenance event:
Create multihost_job
python3 multihost_job.py --COMMAND="bash setup.sh && python3 MaxText/train.py MaxText/configs/base.yml run_name=mattdavidow-mhj-ops-2 steps=5000 base_output_directory=$bd dataset_path=$dataset" --TPU_TYPE=v4-16 --NUM_SLICES=1 --RUN_NAME=mattdavidow-mhj-ops-2 --BUCKET_NAME=gs://mattdavidow-maxtext-br --ENABLE_AUTOCHECKPOINT=true
Simulate a maintenance event:
worker=t1v-xxx
gcloud compute instances add-metadata $worker --metadata simulate-maintenance-event=TRUE
Cloud logging of auto-checkpoint saving: https://screenshot.googleplex.com/5oUab5nPQiZeFWu
Cloud logging after recovery: https://screenshot.googleplex.com/4CzYr5ywwyMYVZT
Prior to this change, cloud logging would not recover after the maintenace event so would look like https://screenshot.googleplex.com/Bzsif2VUf4uga3d - although we can confirm that training did resume by looking at tensorboard or the GCS log that is sent at the end (but cloud logging stopped working).