maxtext icon indicating copy to clipboard operation
maxtext copied to clipboard

Fix multihost_job cloud logging in case of maintenance event + recovery

Open gobbleturk opened this issue 1 year ago • 0 comments

Tested via creating a multihost_job, simulating a maintenance event, and confirming we can see the logs even after recovering from the simulated maintenance event:

Create multihost_job

python3 multihost_job.py --COMMAND="bash setup.sh && python3 MaxText/train.py MaxText/configs/base.yml run_name=mattdavidow-mhj-ops-2 steps=5000 base_output_directory=$bd dataset_path=$dataset" --TPU_TYPE=v4-16 --NUM_SLICES=1 --RUN_NAME=mattdavidow-mhj-ops-2 --BUCKET_NAME=gs://mattdavidow-maxtext-br --ENABLE_AUTOCHECKPOINT=true

Simulate a maintenance event:

worker=t1v-xxx
gcloud compute instances add-metadata $worker --metadata simulate-maintenance-event=TRUE

Cloud logging of auto-checkpoint saving: https://screenshot.googleplex.com/5oUab5nPQiZeFWu

Cloud logging after recovery: https://screenshot.googleplex.com/4CzYr5ywwyMYVZT

Prior to this change, cloud logging would not recover after the maintenace event so would look like https://screenshot.googleplex.com/Bzsif2VUf4uga3d - although we can confirm that training did resume by looking at tensorboard or the GCS log that is sent at the end (but cloud logging stopped working).

gobbleturk avatar Jan 05 '24 03:01 gobbleturk