tonyjohnchen
tonyjohnchen
Example usage: ``` python3 multihost_job.py --COMMAND_TYPE=curl --NUM_SLICES=$NUM_SLICES --RUN_NAME=$RUN_NAME --BUCKET_NAME=$BUCKET_NAME --PROJECT=${PROJECT} --ZONE=${ZONE} --TPU_TYPE=$TPU_TYPE --VERSION=$VERSION --COMMAND="" ```
For AOT+Hybridsim integration, we need to pull Hybridsim docker image, so we need to install docker for maxtext base image (https://screenshot.googleplex.com/BD3SMwL57tP5cQY). Tested AOT+Hybridsim e2e on a GKE node: https://screenshot.googleplex.com/5LQwBNBp4p6iyLq This...
Adding new feature `gradient accumulation` to only update weight for every x steps. Example command without using `gradient accumulation`: ``` python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=${MAXTEXT_OUTPUT_PATH} run_name=${RUN_NAME} enable_checkpointing=false async_checkpointing=false per_device_batch_size=1 skip_first_n_steps_for_profiler=5 steps=30...