[DON'T MERGE] GCS Distributed Training Benchmark Infra + File-parallelism + Range-read Parquet files

Open bernardhan33 opened this issue 7 months ago • 0 comments

This is created as a draft PR for GCS internal members to comment. This will not be merged to main.

File-parallelism + Range-read Parquet files

This PR supports [Benchmark-7: File-Parallelism + Sequential-Read] of Obidos Storage Benchmarks and it builds up the infra for adding benchmark-8, benchmark-9 and potential future GCS PyTorch Connector's benchmarks that use PyTorch Data loading.

Feature Set

configurable epochs, local_batch_size, prefetch_factor, data_loader_num_workers, per_step_computation_time (won't block data prefetching) through arguments.
a sample yaml file that denotes the workload spec. Added here to assist with code review. This YAML file will finally be checked into google3.
File-parallelism among running pods. Range reading Parquet files. Padding parquet file lists to make sure all pods are having the same number of steps to avoid barrier deadlocking.
Metrics aggregation: per-epoch training time, per-step data loading time and per-step total time. The metrics are gathered by variables and uploaded to the GCS bucket in CSV format after the run completes. We are providing a Python program to further aggregate the individual metrics files into a combined CSV file and optionally uploading to BigQuery for efficient and reliable log metric analysis.

TODOs (yet to create tickets, will do)

currently it bypasses the list storm by directly constructing the file names for each pod to read from. We can optionally turn on this option if user wants to test this behavior, and collects the listing metrics.
supports shuffling between epochs.
to build a docker image based off this PR so users would just need to use that image unless they'd want to read/modify the code. The built docker image will be used in the YAML file to be added in the g3doc.

Tests

All metrics have been uploaded to the distributed-training-metrics bucket and the BigQuery dataset is created here. Verified that the BigQuery tables are containing the correct amount of entries and format.

Jul 02 '24 04:07 bernardhan33

maxtext maxtext copied to clipboard

[DON'T MERGE] GCS Distributed Training Benchmark Infra + File-parallelism + Range-read Parquet files

This is created as a draft PR for GCS internal members to comment. This will not be merged to main.

File-parallelism + Range-read Parquet files

Feature Set

TODOs (yet to create tickets, will do)

Tests

maxtext
maxtext copied to clipboard