maxtext
maxtext copied to clipboard
[DON'T MERGE] GCS Checkpointing Testing Workload modification
This is created as a draft PR for GCS internal members to comment. This will not be merged to main.
Checkpointing a 64B model through MaxText
- Read and Write times to be collected and sent to GCS buckets before a separate Python program aggregates and uploads to BQ. I've created b/353631904 to track the improvement of letting each pod to write directly to BQ, which is currently blocked by needed nodepool recreation.
- A sample YAML file is provided for code review purposes.