axlearn icon indicating copy to clipboard operation
axlearn copied to clipboard

Adding LWS Integration

Open Edwinhr716 opened this issue 7 months ago • 0 comments

Added integration with https://github.com/kubernetes-sigs/lws for TPUs, as well as integration of LWS + Pathways.

To run basic LWS+TPU

axlearn gcp launch run --cluster=$CLUSTER \
--runner_name gke_tpu_lws \
--name=$USER \
--instance_type=tpu-v6e-16 \
--bundler_spec=allow_dirty=True \
--bundler_type=artifactregistry --bundler_spec=image=tpu \
--bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
-- sleep infinity;

To run LWS+Pathways

axlearn gcp launch run --cluster=$CLUSTER \
--runner_name gke_tpu_lws_pathways \
--name=$USER \
--instance_type=tpu-v6e-16 \
--bundler_spec=allow_dirty=True \
--bundler_type=artifactregistry --bundler_spec=image=tpu \
--bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
-- sleep infinity;

Edwinhr716 avatar May 12 '25 16:05 Edwinhr716