maxtext icon indicating copy to clipboard operation
maxtext copied to clipboard

[DON'T MERGE] GCS Checkpointing Testing Workload modification

Open bernardhan33 opened this issue 7 months ago • 0 comments

This is created as a draft PR for GCS internal members to comment. This will not be merged to main.

Checkpointing a 64B model through MaxText

  • Read and Write times to be collected and sent to GCS buckets before a separate Python program aggregates and uploads to BQ. I've created b/353631904 to track the improvement of letting each pod to write directly to BQ, which is currently blocked by needed nodepool recreation.
  • A sample YAML file is provided for code review purposes.

bernardhan33 avatar Jul 17 '24 17:07 bernardhan33